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Abstract 


Many  Signal  Processing  problems  may  be  posed  as  statistical  parameter  estimation 
problems.  A  desired  solution  for  the  statistical  problem  is  obtained  by  maximising  the 
Likelihood  (ML),  the  A  -  Posteriori  probability  (MAP)  or  by  optimizing  other  criterion, 
depending  on  the  a  •  priori  knowledge.  However,  in  many  practical  situations,  the  original 
signal  processing  problem  may  generate  a  complicated  optimisation  problem  e.g  when  the 
observed  signals  are  noisy  and  “incomplete” . 

A  framework  of  iterative  procedures  for  maximizing  the  likelihood,  the  EM  algorithm,  is 
widely  used  in  statistics.  In  the  EM  algorithm,  the  observations  are  considered  “incomplete” 
and  the  algorithm  iterates  between  estimating  the  sufficient  statistics  of  the  “complete  data" 
given  the  observations  and  a  current  estimate  of  the  parameters  (the  E  step)  and  maximizing 
the  likelihood  of  the  complete  data,  using  the  estimated  sufficient  statistics  (the  M  step). 
When  this  algorithm  is  applied  to  signal  processing  problems  it  yield,  in  many  cases,  an 
intuitively  appealing  processing  scheme. 

In  the  first  part  of  the  thesis  we  investigate  and  extend  the  EM  framework.  By  changing 
the  “complete  data"  in  each  step  of  the  algorithm  we  achieve  algorithms  with  better  con¬ 
vergence  properties.  We  suggest  EM  type  algorithms  to  optimize  other  (non  ML)  criteria. 
We  also  develop  sequential  and  adaptive  version  of  the  EM  algorithm. 

In  the  second  part  of  the  thesis  we  discuss  some  applications  of  this  extended  framework 
of  algorithms.  We  consider, 

•  Parameter  estimation  of  composite  signals,  i.e  signals  that  can  be  represented  as  a 
decomposition  of  simpler  signals.  This  problem  appear  in  e.g. 

-  Multiple  source  location  (or  bearing)  estimation 

-  Multipath  or  multi-echo  time  delay  estimation 

•  Noise  canceling  in  multiple  microphone  environment,  for  a  speech  enhancement  prob¬ 
lem.  , 
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Chapter  1 

Introduction 

1.1  Introductory  remarks 

Many  signal  processing  problems  may  be  posed  as  statistical  estimation  problems.  A 
celebrated  example  is  the  work  of  Wiener,  who  formulated  the  fundamental  problem  of  fil- 
tering  a  signal  from  an  additive  noise  as  a  statistical  problem,  whose  solution  is  known  now 
as  the  “Wiener  filter".  Other  common  examples  involve  parameter  estimation;  e.g  finding 
the  localization  and  the  velocity  of  targets  in  radar/sonar  environments,  or  synchronization 
(i.e  timing  estimation)  problems  in  communications  systems.  Many  examples  of  the  statis¬ 
tical  analysis  of  signals  processing  problems  may  be  found  in  1',  especially  in  its  second 
and  third  parts. 

In  order  to  formulate  the  statistical  problem,  a  model  assumption  is  needed.  A  specific 
model  may  generate  a  simple  statistical  problem.  However,  it  may  not  represent  the  original 
signal  processing  problem  well.  On  the  other  hand,  another  model  that  tries  to  consider  too 
many  aspects  of  the  original  problem,  may  generate  not  only  a  difficult  statistical  problem, 
but  also  a  non-robust,  possibly  ill-posed  problem.  The  art  of  good  modeling,  which  captures 
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the  important  aspects  of  the  real  problem  without  complicating  the  resulting  mathematical 
or  statistical  problem,  is  probably  the  most  important  factor  in  a  successfully  implemented 
statistical  solution  to  the  underlying  real  problem. 

After  the  statistical  problem  is  formulated,  its  desired  solution  often  requires  the  op¬ 
timisation  of  some  criterion,  depending  on  the  a-priori  knowledge,  and  on  the  (possibly 
subjective)  “risk"  criterion.  Frequently  used  criteria  are  Maximum  Likelihood  (ML)  and 
Maximum  A-Postenon  (MAP).  Even  with  good  modeling,  these  optimization  problems  may 
be  complicated,  e  g  when  the  observed  signals  are  noisy  and  incomplete.  These  optimiza¬ 
tion  problems  are  rarely  solved  analytically.  Instead,  standard  iterative  search  methods,  e  g. 
gradient  methods.  Newton- Raphson  method,  are  often  used.  The  standard  methods  have 
some  well  known  numerical  problems.  Furthermore,  these  methods  may  still  be  complicated 
since  they  require  the  calculation  of  the  gradient  and  sometimes  the  Hessian  matrix.  These 
standard  search  methods  rarely  generate  intuitive  algorithms  for  the  original  real  problem. 

An  interesting  alternative  to  the  straightforward  gradient  or  Newton  methods  has  been 
introduced  in  2:  This  technique,  known  as  the  Estimate-Maximize  (EM)  algorithm,  sug¬ 
gests  an  iterative  algorithm  that  exploits  the  properties  of  the  stochastic  system  under 
consideration.  The  EM  algorithm  is  actually  a  framework  of  iterative  algorithms.  To  im¬ 
plement  an  EM  algorithm,  one  has  to  consider  the  observations  as  incomplete  with  respect 
to  more  convenient  choice  of  complete  data.  The  algorithm  then  iterates  between  estimating 
the  sufficient  statistics  of  the  complete  data,  given  the  observations  and  a  current  estimate 
of  the  parameters  (the  E  step),  and  maximizing  the  likelihood  of  the  complete  data  using 
the  estimated  sufficient  statistics  (the  M  step). 

As  will  become  evident  in  the  course  of  this  thesis,  the  EM  method  may  yield  intuitive 
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processing  schemes  for  che  original  signal  processing  problem,  by  innovatively  choosing  the 
complete  data.  Therefore,  it  is  not  surprising  that  some  previously  proposed  algorithms 
for  solving  various  signal  processing  problems  can  be  interpreted  in  the  EM  algorithm 
context.  One  example  is  the  iterative  speech  enhancement  method  suggested  by  Lim  and 
Oppenheim  [3].  We  will  return  to  this  example  later  in  the  thesis.  Other  algorithms 
that  have  been  suggested  intuitively  to  solve  specific  signal  processing  problems,  e.g.  the 
iterative  channel  estimation  algorithm  of  [4j,  the  iterative  reconstruction  algorithm  of  [5), 
the  iterative  resolution  technique  of  6]  and  more,  can  also  be  interpreted  as  examples  of 
the  EM  algorithm. 

It  is  particularly  important  to  note,  at  this  point,  the  work  of  Musicus  [7]  and  [8).  In 
this  work,  a  general  class  of  iterative  algorithms  has  been  suggested  to  minimize  a  special 
form  of  the  Relative  Entropy.  In  some  special  cases,  the  minimum  relative  entropy  criterion 
reduces  to  the  maximum  likelihood  criterion,  and  in  those  cases,  the  suggested  iterative 
algorithms  reduce  to  the  EM  algorithm.  This  work  was  an  important  inspiration  for  this 
thesis.  We  will  discuss  this  approach  later  in  the  thesis  in  conjunction  with  our  work  on 
general  information  criteria  and  the  EM  algorithm. 

Our  exposure  to  a  variety  of  estimation  problems  in  oceanography,  specifically  in  under¬ 
water  acoustics,  also  had  a  major  impact  on  this  thesis.  We  have  found  that  many  of  these 
problems  were  approached  suboptimally,  probably  since  the  standard  mathematical  models 
of  these  problem  usually  generate  statistical  problems  whose  direct  solution  is  complicated. 
We  have  suggested  the  EM  iterative  algorithm  as  a  better  approach  to  solve  these  statistical 
problems.  Later  in  the  thesis,  we  will  describe  how  modeling  considerations  and  EM  algo¬ 
rithms  have  applied  to  array  processing  and  time  delay  estimation  problems  in  underwater 
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acoustics,  and  have  generated  interesting  solution  procedures.  The  important  experience 
with  oceanographic  signal  processing  problems  established  and  confirmed  our  approach  for 
solving  statistical  signal  processing  problems  in  general,  and  the  other  problems,  presented 
later  in  the  thesis,  in  particular. 

In  summary,  this  thesis  presents  a  class  of  iterative  and  adaptive  algorithms,  based 
on  the  ideas  that  led  to  the  CM  algorithm,  to  optimise  various  statistical  criteria.  In 
addition,  the  thesis  will  address  several  signal  processing  problems  and  show  that  by  using 
a  reasonable  model,  an  appropriate  statistical  criterion  and  an  CM  algorithm,  an  insightful 
solution  procedure  may  be  achieved  and  implemented  successfully. 


1.2  Preview  and  organization  of  the  thesis 

The  application  of  the  CM  algorithm  to  a  real  world  problem  first  requires  modeling  the 
problem  statistically  and  then  applying  the  CM  algorithm  to  solve  the  resulting  statistical 
problem.  However,  the  CM  algorithm  is  not  uniquely  defined:  it  depends  on  the  choice  of 
complete  data.  An  unfortunate  choice  may  yield  a  completely  useless  algorithm. 

In  this  thesis  we  will  consider  the  following  signal  processing  problems: 

•  Parameter  estimation  of  superimposed  signals,  i.e  signals  that  can  be  represented  as 
a  sum  of  simpler  signals.  We  will  consider  specifically  the  problems  of  multiple  source 
location  (or  bearing)  estimation,  multipath  or  multi*echo  time  delay  estimation  and 
spectral  estimation. 

•  Noise  canceling  in  a  multiple  microphone  environment.  The  real  world  application  is 
a  speech  enhancement  problem. 
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•  Signal  reconstruction  from  partial  information.  For  this  problem,  we  will  present  ideas 
and  propose  further  research. 

We  will  suggest  statistical  models  for  these  signal  processing  problems  and  solve  the  resulting 
statistical  problems  by  the  EM  method.  In  all  these  problems  we  will  use  a  natural  choice 
of  complete  data. 

In  the  process  of  considering  the  applications  mentioned  above,  we  have  modified  and 
extended  the  scope  of  the  EM  method  and  derived  explicit  forms  for  some  important  special 
cases.  Each  of  these  results  may  be  considered  as  a  contribution  to  the  EM  algorithm  at  a 
theoretical  level.  We  have  also  developed  and. analyzed  sequential  and  adaptive  algorithms 
based  on  the  EM  algorithm. 

As  a  result  of  these  contributions  a  general  and  flexible  class  of  iterative  and  adaptive 
estimation  algorithms  is  established.  Beyond  the  theoretical  contributions  and  the  spe¬ 
cific  applications,  we  believe  that  this  thesis  suggests  a  way  of  thinking  and  a  philosophy 
which  may  be  used  in  a  large  variety  of  seemingly  complex  statistical  inference  and  signal 
processing  problems. 

This  thesis  is  organized  as  follows.  Chapter  2  and  3  provide  the  theoretical  background 
and  contributions.  In  Chapter  2,  we  start  with  a  review  of  the  EM  algorithm  as  developed 
in  21,  and  give  its  basic  convergence  properties  following  the  considerations  in  9j.  We  then 
derive  the  EM  algorithm  for  the  linear  Gaussian  case,  whose  importance  will  be  evident 
later  in  the  thesis.  We  also  modify  the  basic  EM  algorithm  and  extend  it,  so  that  it  may 
be  applied  to  general  estimation  criteria. 

Any  iterative  algorithm  implies  an  adaptive  or  sequential  estimation  procedure,  in  which 
the  new  iteration  takes  into  account  new  data  points.  A  derivation  of  a  class  of  sequential 
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algorithms  based  on  the  EM  structure  is  presented  in  Chapter  3.  This  class  of  algorithms 
may  have  the  important  tracking  capabilities  typical  to  an  adaptive  algorithms  together 
with  desirable  asymptotic  convergence  results,  achieved  from  the  EM  theory. 

The  signal  processing  applications  of  this  class  of  iterative  and  adaptive  algorithms  are 
presented  in  Chapters  4  and  5.  In  Chapter  4  several  problems  that  arise  in  radar/sonar 
signal  and  array  processing  are  presented.  Those  problem  involve  multiple  targets  and 
multipath  signals.  A  more  general  problem  is  the  estimation  of  parameters  of  superimposed 
signals.  We  will  describe  an  EM  solution  to  the  general  superimposed  signals  problem,  and 
apply  it  to  multiple  target  bearing  estimation  and  to  multipath  time  delay  estimation. 
Sequential  algorithms  to  solve  this  problem  will  also  be  suggested. 

The  problem  of  multiple  microphone  noise  cancellation  is  presented  in  Chapter  5.  Using 
models  of  the  speech  and  the  noise,  a  statistical  problem  is  formulated  and  then  solved  using 
the  EM  algorithm.  This  solution  generates  an  intuitive  processing  scheme,  that  provides  a 
novel  solution  to  this  well-investigated  problem.  An  adaptive  scheme  baaed  on  the  above 
algorithm,  may  be  an  alternative  to  Widrow's  algorithm  [10j. 

Chapter  6  is  entitled  “Information,  Relative  Entropy  and  the  EM  algorithm”.  It  presents 
several  interesting  results  that  give  an  alternate  interpretation  to  the  EM  algorithm  and  to 
information  criteria  mentioned  in  the  EM  algorithm  context. 

Chapter  7  will  conclude  and  summarise  the  thesis.  We  will  also  suggest  in  this  chapter 
topics  for  further  research.  As  one  of  these  topics,  we  will  present  specific  ideas  for  solving 
problems  of  signal  reconstruction  from  partial  information.  A  statistical  framework  for  these 
problems  is  developed  and  EM  algorithms  for  solving  this  statistical  problem,  by  optimizing 
the  likelihood,  the  a- posteriori  probability  or  other  appropriate  criteria  are  derived. 
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Chapter  2 


The  EM  method:  Review  and  new 
developments 


In  this  Chapter  we  review  the  Estimate-Maximize  (EM)  algorithm  for  solving  maximum 
likelihood  (ML)  and  maximum-a-posteriori  (MAP)  estimation  problems,  and  present  new 
developments  that  extend  the  scope  of  the  algorithm,  and  make  it  more  accessible  for  solving 
signal  processing  problems. 

The  chapter  is  organized  as  follows.  In  section  2.1,  the  basic  EM  algorithm  is  presented, 
following  the  considerations  in  [2j.  In  section  2.2,  we  analyze  and  discuss  the  convergence 
properties  of  the  EM  algorithm.  The  results  presented  here  clarify  and  simplify  the  conver- 
gence  analysis  presented  in  (2j  and  [9). 

In  section  2.3,  the  EM  algorithm  is  explicitly  derived  for  the  special  but  important 
case  where  the  observed  (incomplete)  data  and  complete  data  are  jointly  Gaussian,  related 
by  a  linear  non-invertible  transformation.  In  perspective,  the  linear-Gaussian  case  was  an 
important  step  towards  the  application  of  the  EM  algorithm  to  signal  processing  problems. 
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In  sections  2.4  and  2.5.  we  present  new  ideas  and  results  that  extend  the  scope  of  the 
EM  method.  The  results  in  these  sections  generate  a  more  general,  yet  more  flexible,  class 
of  iterative  algorithms. 

Section  2.6  concludes  this  chapter,  by  discussing  the  possible  signal  processing  applica¬ 
tions  of  the  EM  framework. 


2.1  Basic  theory  of  the  EM  algorithm 

Let  Y.  denote  a  data  vector  with  the  associated  probability  density  /^(gjg),  indexed 
by  the  parameter  vector  g  €  0,  where  0  is  a  subset  of  the  k-dimensional  Euclidean  space. 
Given  an  observed  g,  the  maximum  likelihood  (ML)  estimate,  >s  the  value  of  g  that 
maximizes  the  log- likelihood,  that  is, 

Iml  =  /}:(&;£)  (2.1) 

Finding  the  ML  estimator  is  often  desirable  since  it  is,  in  most  cases,  asymptotically  con¬ 
sistent  and  efficient.  However,  in  many  cases,  the  maximization  problem  of  (2.1)  is  compli¬ 
cated. 

Suppose  that  the  data  vector  Y  can  be  viewed  as  being  incomplete,  and  we  can  specify 
some  data  &  related  to  Y  hy 

T(K)  =  E  (2-2) 

where  T(-)  is  a  non-invertible  (many  to  one)  transformation.  If  an  observation  g  of  £  is 
given,  an  observation  g  of  Y  is  available  too,  but  not  vice  versa.  X.  will  be  referred  to  as 
the  complete  data.  The  probability  density  of  the  complete  data,  denoted  /x(g;g),  is  also 
indexed  by  the  parameter  vector  g.  Assume  that  H  is  specified  so  that  if  g  is  available, 
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finding  the  maximum  likelihood  estimate  of  $  is  easy,  i.e.  solving 

9  =  arg  m«  log  /*(*;  i)  (2.3) 

is  straightforward.  The  EM  algorithm,  presented  below,  will  use  the  simple  procedure  for 
ML  estimation  in  the  complete  data  model,  as  a  part  of  an  iterative  algorithm  for  ML 
estimation  in  the  observations’  model. 

Given  a  sample  of  the  incomplete  data  g,  the  complete  data  l  must  be  a  member  of  the 


set  X(g)  where. 


•(¥)  =  {*ir(i)  =  *} 


Since  Y  i»  »  many  to  one  function  of  the  complete  data  £>  the  probability  density 
functions  of  the  complete  and  incomplete  data  satisfy, 

ftix  i)  =  /  /*(*;  iWi  (2-5) 

■/tig) 

The  conditional  density  of  X  given  Y  =  *•  over  [he  set  X  (y).  This  probability 

density  function  is  given  by, 

/u-i( *;«  =  r'f*f*{T9)dz  =  (2-6> 

Taking  the  logarithm  on  both  sides  of  (2.6)  and  rearranging,  we  obtain 

lo*/n  (#$)  *  |of/i(i;8)  ~  vietix)  (2.7) 

We  can  now  take  the  conditional  expectation  over  £,  of  both  sides  of  (  2.7),  given  Y  -  i 
and  an  arbitrary  parameter  value  (.  The  left  hand  side  remains  unchanged,  and  we  get, 

log  /L(*  i)  =  E  { log  /*(*;  i) i  r  =  f  }  -  E  {log  fad */* €)  j  L  =  *  f }  (2.8) 


Define,  for  convenience, 


i(8)  =  iog/L(K;«) 
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Q(t  t)  =  E  {log  /*U;  t)  Y.  =  g;  t }  =  ^  log /*(»  g)  •  /i/n=e(i;  ?)<**  (2.10) 

ff(t.f)  =  £{log/i/n=y(i;«)  K  =  2;£}  =  /r  (w  log  /i;n=y(f;  g)  •  /ii=,(i;  gV* 

(2.11) 

With  these  definitions,  equation  (2.8)  reads 

£(«)  =  $(4^) -*(«.*)  (2.12) 

Equations  (2.7)  and  (2.8)  are  interesting  identities  for  L(g),  the  log-likelihood  of  the 

# 

observations.  Equation  (2.7)  is  true  for  any  j  €  Z(^).  Equation  (2.8),  or  equivalently 
equation  (2.12)  is  true  for  any  pair  g.g*  €  0. 

Consider  now  Jensen's  inequality  (see  e.g.  equations  le.5.6  and  le.6.6  in  [111)  which 
states  that  for  any  two  p.d.f.’s  /  and  g  defined  over  the  same  sample  space, 

£/{log 9)  <  £/{l°g/}  (2.13) 

where  equality  holds  if  and  only  if  /  =  g  almost  everywhere  Ef  denotes  the  expectation, 
using  the  p.d.f.  /  Let  /  =  /&£( t\£)  and  g  =  /^/^(i;  €),  both  defined  over  the  sample 
space  JT(g)  Substituting  in  (2.13),  and  using  the  definition  of  H(-.  •),  we  get, 

(2-14) 

Suppose  we  can  find  g  such  that, 

Q{tt)  .  (2.15) 

In  this  case,  using  (2.14)  and  (2.12),  we  conclude  that, 

Lit)  >  L(t)  (2.16) 
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A  procedure  for  iteratively  increasing  the  likelihood  may  be  suggested  baaed  on  (2. IS) 
and  (2.16)  as  follows.  Given  a  value  of  the  parameter  g(n),  we  will  find  a  new  value 
that  satisfies  (2.15),  and  thus  increases  the  likelihood,  by  maximizing  Q(£;  £*"*)■  This 
procedure  is  the  CM  algorithm,  which  we  now  formally  present. 

•  Start,  n  =  0  :  guess 

•  Iterate  (until  some  convergence  criterion  is  achieved) 

-  The  E  step:  calculate 

Q(i\i{n])  =  E  {log  /*(i;«)  |  il  =  j/;  }  (2.17) 

-  The  M  step:  solve 

=argmaxg(*;g(,,))  (2.18) 

-  n  =  n  —  l 

The  “E"  stands  for  the  conditional  Expectation,  or  Estimation  performed  in  the  E  step, 
and  the  “M"  stands  for  the  Maximization  performed  in  the  M  step. 

An  EM  iteration  may  be  summarized  by  the  updating  equation. 

£<"**>  =  arg E  { log  /^(i;  £)  i  }  (2.19) 

This  iteration  is  justified  intuitively  as  follows.  We  would  like  to  choose  9  that  maximizes 
log  /^(i;  £),  the  log-likelihood  of  the  complete  data.  However,  since  log  2)  is  not  avail¬ 
able  to  us  (because  the  complete  data  is  not  available),  we  maximize  instead  its  expectation, 
given  the  observed  data  g,  and  the  current  value  of  the  parameters 

An  iterative  procedure  that  increases  the  likelihood  is  also  achieved,  if  instead  of  max¬ 
imizing  <?(£;£^),  we  just  increase  it.  Thus,  we  may  replace  the  M  step  by  the  following 
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seep: 


g(n^)  _  (2.20) 

where  .Vf(0)  is  any  mapping  that  satisfies 

Q(M(t)-d)  >  Q{t,9)  Hee  (2.21) 

This  variation  of  the  algorithm  was  named  the  Generalized  CM  algorithm  (GEM)  by  Demp¬ 
ster  et.  al  2j.  A  special  case,  of  course,  is  the  EM  algorithm. 

The  motivation  of  the  GEM  algorithm,  which  also  applies  to  the  EM  algorithm,  is 
summarized  in  the  following  theorem,  (theorem  1  of  [2)).  This  theorem  carefully  states  the 
basic  monotonicity  property  of  the  GEM  algorithm.  The  proof  of  this  theorem  will  follow 
immediately  from  the  considerations  above. 

Theorem  2.1  For  every  GEM  algorithm, 

£(g<n*l,)  >  L^n))  (2.22) 

where  equality  holds  xf  and  only  if  both 

=  Q(g(n);g(n))  (2.23) 

and 

=  fxjz(l/&i{n])  a.einX(y)  (2.24) 

Proof  By  the  definition  of  the  GEM  algorithm 

Thus,  since  #<"»)  <  //($<">; 

L(l{n+l))  >  L(Sin)) 

Now  by  (2.12),  equality  holds  if  and  only  if 

<3(g("~U.£<«>)  =  <J(g<n>;g<n>)  and  j*n>) 
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The  latter  holds  if  and  only  if 


fvxS*  tf"*1*)  =  f.l/Y(£/-,  0jn}),  a-e  in  r(jr) 

This  theorem  leads  to  the  following  corollaries: 

Corollary  2.1  If  for  some  O'  €  0,  L(0')  >  L(0 ),  VO  €  0,  then,  for  every  GEM  algorithm 

M{1')  =  i' 

Corollary  2.2  Suppose  for  some  i’  €  0,  £(£')  >  L(i),  Vg  €  0.  Then,  for  every  GEM 
algorithm 

L(M(0’))  =  Lift) 

Q(M(r),r)  =  Q(r,r) 

fxjYU-,m'))  =  fxjtU/-,n  «» r(2) 

In  other  words,  if  a  unique  global  maximum  of  the  likelihood  exists,  it  is  a  fixed  point 
of  any  GEM  algorithm.  If  we  have  a  set  of  global  maxima,  the  GEM  algorithm  may  move 
inside  this  set.  However,  each  new  value  must  satisfy  the  conditions  of  corollary  2.2. 

We  note  that  the  EM  algorithm  is  actually  a  class  of  algorithms.  There  are  many 
complete  data  specifications  X,  that  will  generate  the  observed  data  Y.  The  choice  of 
complete  data  may  critically  affect  the  complexity  and  the  convergence  properties  of  the 
algorithm.  An  unfortunate  choice  of  complete  data  will  yield  a  completely  useless  algorithm. 
Thus,  it  takes  creativity  to  apply  the  EM  algorithm  to  a  given  problem.  This  will  be 
demonstrated  later  in  the  thesis  when  we  solve  specific  signal  processing  problems. 

To  complete  the  basic  theory  of  the  EM  algorithm,  we  will  present  in  section  2.2  the 
convergence  properties  of  the  algorithm.  This  presentation,  which  clarifies  and  simplifies 
previous  results,  may  be  used  as  a  future  reference  on  these  topics. 
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The  EM  algorithm  for  exponential  families 


Examining  the  expressions  for  the  EM  algorithm,  (2.17)  and  (2.18),  we  note  that,  in  general, 
the  EM  algorithm  may  be  complicated.  The  calculation  of  <2(0,0*"*)  in  the  E  step  may 
require  multiple  integration,  and  the  maximization  in  the  M  step  is,  in  general,  a  non-linear 
optimization  problem.  However,  in  the  case  of  exponential  families  of  distributions,  which 
is  now  described,  the  E  step  has  an  explicit  simple  form  and  the  maximization  performed 
in  the  M  step  is  as  complicated  as  solving  a  maximum  likelihood  problem  for  the  complete 
data,  which  is  assumed  to  be  easy. 

Suppose  that  the  p.d.f.  of  the  complete  data,  j,  belongs  to  the  exponential  family  of 
probabilities,  i.e. 

/*(»«)  =  (2-25) 

The  set  of  statistics  (f,(x)}  is  the  sufficient  statistics.  This  set  is  denoted  T  (j).  Note  that 
the  exponential  family  of  distributions  includes  almost  all  common  p.d.f. ’s  e.g.  Gaussian, 
binomial,  exponential  etc. 

The  log-likelihood  of  the  complete  data  for  exponential  families  has  the  form 

log/jc(i;f)  =  -loga(g)  +  £«,(0)7't,(l)  +  }og&(l)  (2.26) 

independent  of  £ 

Due  to  this  special  form  of  the  log-likelihood,  we  need  only  to  calculate,  in  the  E  step, 
the  conditional  expectation  of  the  sufficient  statistics.  We  then  substitute  the  estimated 
sufficient  statistics  in  the  likelihood  of  the  complete  data,  and  maximize  the  resulting  ex¬ 
pression  in  the  M  step.  The  E  and  M  of  the  EM  algorithm  steps,  for  exponential  families, 
reduce  thus  to 


21 


•  The  C  step:  Calculate 

7(*>(*)  =  £<T(d/rt|W} 

or,  Vi,  calculate 

tln)  =  E{tA£)/y&{n))  (2-27) 

•  The  M  step:  solve 

£<"w>  =  argmax|-loga(#)  +  ^^(g)rt<n)(i)J  (2.28) 

The  sufficient  statistics  are  usually  simple  functions  of  the  data,  I,  and  therefore  explicit 
formulae  usually  exist  for  the  E  step  above.  The  expression  to  be  maximized  in  the  M  step 
has  the  functional  form  (w.r.t  9)  of  the  log-likelihood  of  the  complete  data.  Since  maximizing 
the  likelihood  of  the  complete  data  is  assumed  to  be  easy,  the  implementation  of  the  M  step 
above  is  easy  too. 

Gaussian  distribution  belongs  to  the  exponential  family  of  distributions.  In  section  2.3 
we  will  derive  a  closed  form  analytical  expression  for  Q(t,£),  i.e.  the  E  step,  for  the  case 
where  X  and  Y.  are  jointly  Gaussian  related  by  linear  transformation.  The  maximization 
problem  in  the  M  step,  for  this  linear-Gsuasian  case,  will  be  as  complicated  as  solving  a 
maximum  likelihood  problem  in  the  complete  data  model. 


2.2  Convergence  results 

The  EM  (or  the  GEM)  algorithm  generates  a  sequence  of  parameters,  {$"*},  and 
an  associated  sequence  of  log-likelihoods,  {£(**},  where  =  £(£*"*).  We  have  shown 
that  each  iteration  increases  the  likelihood,  i.e.  the  likelihood  sequence  is  a  monotonic 
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nondecreasing  sequence  (I1'"'11  >  £ln|).  However,  the  EM  theory  should  also  answer  the 
following  important  questions: 

•  Do  the  likelihood  and  the  parameter  sequences  converge? 

•  To  where  will  they  converge? 

•  How  fast  will  they  converge? 

These  convergence  issues  will  be  addressed  as  follows.  The  convergence  of  the  likelihood 
sequence  will  be  considered  first.  The  issue  of  its  convergence  to  a  global  maximum,  local 
maximum  or  a  stationary  point  will  be  discussed.  Then,  the  convergence  of  the  parameter 
estimate  sequence  will  be  considered,  noting  that  even  if  the  likelihood  sequence  converges 
(say  to  L"),  the  associated  parameter  sequence  may  not  converge,  i.e.  it  may  have  a  set  of 
limit  points,  each  of  which  corresponds  to  this  likelihood  value  L' .  For  the  cases  in  which 
the  sequences  do  converge,  the  rate  of  convergence  in  the  neighborhood  of  the  convergence 
point  will  be  calculated. 

Our  discussion  in  this  section  follows  the  considerations  in  Wu.  '91,  and  the  original 
paper  of  Dempster  et.  al.  [2\.  Another  important  reference  is  121 .  The  rate  of  convergence 
and  the  computation  of  the  Fisher  information  matrix  associated  with  the  EM  parameter 
estimate  sequence  are  also  discussed  in  [131,  ’14j  and  elsewhere. 

The  following  notation  and  assumptions  are  used  in  this  section.  Let  9  be  the  set  of 
possible  parameter  values,  which  is  assumed  to  be  a  subset  of  the  k-dimensional  Euclidean 
space.  6o  is  the  set 

0o  =  {«€0li(«)  >  L(t‘°»)} 

and  it  is  assumed  to  be  compact  for  any  L(^0’)  >  -so. 
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,M  will  denote  the  set  of  local  maxima  of  £(•)  while  S  will  denote  the  set  of  stationary 
points  of  £(•),  in  the  interior  of  8. 

An  EM  (GEM)  iteration  may  be  denoted  by 

0l">  —  £  M{9(n)) 

where  M(*)  is  a  point  to  set  mapping  such  that  M(g("*)  is  the  set  of  maximizers  of  Q(g;  g(n|) 
over  9  €  8  for  an  EM  algorithm,  and  such  that 

0 

Q(t  i(n))  >  Q(*{n);  S{n)),  €  Af(£{n>) 

for  a  GEM  algorithm. 

2.2.1  Convergence  of  the  likelihood  sequence 

As  shown  in  theorem  2.1,  the  likelihood  sequence,  {£'"*},  is  a  monotonic  nondecreasing 
sequence.  Thus,  if  this  sequence  is  also  bounded,  it  converges  to  some  value  L\  Only  in 
rare  and  singular  cases  can  we  find  a  non  bounded  likelihood  sequence.  Furthermore,  if 
the  likelihood  function  L(  )  is  continuous  in  8,  the  compactness  of  8o  guarantees  that  the 
likelihood  sequence,  { L ln)},  is  bounded  for  any  starting  point  €  8.  Thus,  the  likelihood 
sequence  can  be  expected  to  converge  in  most  cases  to  some  L~. 

We  want  to  know  whether  L'  is  a  global  maximum,  a  local  maximum  or  at  least  a  sta¬ 
tionary  value  of  L(i)  over  8.  Unfortunately,  as  for  any  general  “hill  climbing”  optimization 
algorithm,  there  is  no  guarantee  that  the  EM  algorithm  will  converge  to  a  global  or  even 
a  local  maximum.  It  has  been  reported,  in  jl5j  and  [  18)  that,  if  the  log-likelihood,  L,  has 
several  maxima  and  stationary  points,  then  convergence  of  the  EM  algorithm  to  either  type 
of  point  depends  on  the  choice  of  the  starting  point.  Note  that  this  phenomenon  occurs 
despite  the  fact  that  we  may  perform  a  global  maximization  (of  Q)  in  the  M  step. 


In  Appendix  A  we  consider  the  convergence  issue  more  precisely,  where,  as  in  many 
numerical  analysis  algorithms,  the  convergence  analysis  is  based  on  the  Global  Convergence 
Theorem  which  may  be  found  in  .171  page  91,  and  [18!  page  187.  This  theorem  provides 
sufficient  conditions  that  guarantee  the  convergence  of  a  general  iterative  procedure 

g*"’1*  =  A/(g<")) 

to  a  solution  set. 

For  the  EM  algorithm,  where  .V/(g*B))  is  the  set  of  maximizers  of  <3(g;  $"*),  it  is  shown 
in  Appendix  A,  that  the  simple  condition, 

Q(g,;  jj)  is  continuous  in  both  g,  and  gj  (2.29) 

in  addition  to  the  compactness  of  9o,  guarantees  the  convergence  to  the  solution  set  S, 
i.e.  this  condition  implies  that  the  likelihood  sequence  of  the  EM  algorithm  converges  to  a 
stationary  value. 

A  stronger  sufficient  condition  is  needed  to  guarantee  convergence  to  a  local  maxima. 
Again,  in  Appendix  A,  it  is  shown  that,  if  in  addition  to  the  continuity  condition  (2.29)  Q 
satisfies 

sup  Q(£\i)  >  Vg  S  (5  -  M)  (2.30) 

j'ee 

where  (S  -  M)  is  the  difference  set  (g  €  S  9  £  .M),  then  the  likelihood  sequence  converges 
to  a  local  maxima,  i.e.  to  the  solution  set  X. 

Since  condition  (2.30)  is  hard  to  verify,  we  may  have  to  be  satisfied  with  a  proof  of 
convergence  to  a  stationary  point,  even  when  the  EM  algorithm  does  converge  to  a  local 
maximum.  Condition  (2.30)  is  not  met  in  general,  and  the  EM  algorithm  converges  to  a 
stationary  value,  local  maximum  or  global  maximum  depending  on  the  choice  of  starting 
point. 
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2.2.2  Convergence  of  the  parameter  estimate  sequence 

The  convergence  of  the  likelihood  sequence  does  not  imply  the  convergence  of  the  pa¬ 
rameter  estimate  sequence.  Suppose  that  the  likelihood  sequence  converges  to  L'  and  that 
the  conditions,  that  guarantee  the  convergence  to  a  stationary  point,  are  satisfied.  Define 

${L')  =  {«  €  S\L(0)  =  L’) 

The  sequence  of  estimates,  {£ln)},  may  not  converge,  i.e.  it  may  have  a  (possibly  infinite) 
set  of  limit  points.  We  may  only  say  that  ail  limit  points  of  {£'"'}  are  in  S(L'). 

The  convergence  of  the  parameter  sequence  may  be  guaranteed  (trivially),  if  the  solution 
set,  i.e.  5(1')  in  the  example  above,  has  a  single  point.  An  important  special  case,  in  which 
the  solution  set  is  a  singleton,  is  when  the  likelihood  function  is  unimodal  in  6. 

The  requirement  that  the  solution  space  has  a  single  point  may  be  relaxed,  and  it  may 
be  shown,  see  Appendix  A,  that  if  the  solution  set  is  discrete  and 

Jirn  -£(n)!|  =  0  (2.31) 

the  parameter  estimates  sequence  will  converge. 

Condition  (2.31)  may  be  easily  verified  in  many  applications.  For  the  EM  estimate 
sequence,  since  L — ►  Lm ,  and  since 

L(n+1)  -  >  Q(g<'*-|,;£("))  -  <2(£<");£<n))  (2.32) 

a  sufficient  condition  for  (2.31)  is  that  there  exist  a  forcing  function  <?(•),  such  that 

g(g(n^D.g(»))  _  >  ,(||g<"-»>  -  fWij),  vn  (2.33) 

where  a  forcing  function  is  a  function  such  that  for  any  sequence,  {*„}, 

lim  *(*„)  -  0  =*•  lim  x«  =  0  (2.34) 

n—oo  n— *oo  ' 
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Taking  tr(x)  =  Ax2,  A  >  0  as  the  forcing  function,  we  get  the  sufficient  condition 

g(g<n+i) ;g(»))  _  >  Ai|0<r"*l)  -  0(n,j|2,  Vn  (2.35) 

which  may  be  verified  easily  in  several  applications. 

One  may  argue,  that  the  convergence  of  the  parameter  sequence  is  not  as  important 

as  the  convergence  of  the  likelihood  sequence  to  the  desired  location  on  the  log-likelihood 

surface-  However,  one  should  be  aware  of  the  possibility  of  a  non-convergent  estimate 

# 

sequence,  e.g.  if  L(9)  has  a  ridge  of  stationary  points  in  which  L(9)  =  L',  then  the  set  S(L) 
is  not  discrete  and  the  EM  algorithm  may  move  indefinitely  on  that  ridge. 


2.2.3  Rate  of  convergence 


When  the  EM  (or  GEM)  algorithm  converges,  an  interesting  and  important  problem 
is  the  determination  of  its  rate  of  convergence.  In  this  section,  after  defining  the  rate 
of  convergence  and  other  terms,  that  are  commonly  used  in  association  with  it,  we  will 
calculate  the  rate  of  convergence  of  the  EM  algorithm. 

Let  us  denote  the  differentiation  operator  D.  A  differentiation  operator  with  respect  to 
two  variable  will  be  denoted  D*1  as 


^*,.£^61! 

The  following  identities  will  be  needed  later  when  the  rate  of  convergence  of  the  EM  algo¬ 
rithm  is  explicitly  developed, 


DL(9)  =  Dl0Q(i,9)  (2.36) 

DH(l)  =  Di0Q(ki)  -  D20H[t,i)  =  D20Q(t,i)  *  DnH(ki)  (2.37) 
DnQ(ki)  =  DnH(ki)  (2.38) 
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These  identities  have  already  been  recognised  by  Fisher  [19) ,  and  are  redeveloped  in  [2jn'l3] 
and  in  Appendix  A. 

Definitions  and  background 

Consider  a  sequence,  {zn},  converging  to  a  limit,  z*.  Each  element,  x„,  belongs  to  X, 
^  which  is  a  subset  of  some  norm-space  (say  the  k-dimensional  Euclidean  space).  We  define 

the  order  of  convergence  as  follows: 


Definition  2.1  The  order  of  convergence  of  a  sequence,  {zn},  that  converges  to  x‘ ,  denoted 
p,  is  the  supremum  of  the  nonnegative  numbers,  p' ,  for  which  the  following  ratio  is  finite, 
i.e.  for  which 


a 


lim 

n—oo 


Hl«+1 


!*n  ~ 


-«■!! 

*m\f 


<  00 


Loosely  speaking,  the  order  of  convergence  describes  the  asymptotic  behavior  of  the 


error  sequence,  {«„},  where  en  =  x„  -  x’ ,  i.e.  as  n  —  oo  we  have 


en*iil  =  ai|<r„i!p 


So.  the  larger  p  is,  the  faster  the  sequence,  {zn},  converges. 

Most  iterative  algorithm  generate  sequences,  whose  order  of  convergence  is  unity.  In 
this  case,  the  important  number  is  the  convergence  rate  defined  by. 


Definition  2.2  The  convergence  rate  of  a  sequence,  {z*},  that  converges  to  z' ,  denoted  a, 
is 


where  0  <  a  <  1 . 
tf  a  =  0. 


,l*n+ 1  ~  *  ■! 

l!*n  ~  *11 

The  sequence  is  said  to  converge  linearly  t/0  <  a  <  1,  and  superlinearlg 


a  =  lim 

ll**«  —  *1 


The  convergence  rate  of  any  sequence,  whose  order  is  greater  then  unity,  will  be  zero. 


These  sequences  have  superlinear  convergence.  We  note,  however,  that  a  sequence  with 
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unity  order  of  convergence  may  also  have  a  superiinear  convergence.  Linear  convergence 
is  sometimes  referred  to  as  geometric  convergence  or  exponential  convergence,  since  in  this 
case  the  error  sequence.  {«„},  is  a  geometric  sequence. 

In  many  iterative  algorithms,  the  iteration  is  defined  via  a  mapping,  that  successively 
approximates  the  solution,  i.e. 

Xn+i  =  M(x„) 


In  this  case,  we  may  find  the  rate  of  convergence  by  investigating  the  Jacobian  matrix  (or 
the  matrix  of  derivatives)  of  this  mapping,  defined  by, 


;s«(r)L  .  .  h.  iiSH- *<*•«! 

d'r>  ”  I*.-*-!/ 


(2.39) 


where  ?•],  denotes  the  ith  component  of  a  vector. 
Since 


(2  «) 


n-oo  x„-X'  j  "~°o  i '  X„  -  X ' ;  ( 
the  largest  eigenvalue  of  the  matrix  DM(x’),  will  provide  us  with  the  convergence  rate  of 
the  iterative  algorithm. 


Rate  of  convergence  of  the  EM  algorithm 

The  rate  of  convergence  of  the  EM  algorithm  can  be  calculated  by  deriving  the  Jacobian 
matrix  of  the  mapping,  g*"'1'1*  =  M{9^),  associated  with  the  EM  algorithm.  We  recall 
that  this  mapping  is  defined  by 

9{nTl)  =  Af(g<">)  =  argmaxQU;^"*)  (2.41) 

Using  the  fact  that  ^n*)  =  0,  this  Jacobian  matrix  can  be  easily  calculated  as 

follows.  Since  the  vector  DiaQ(4"+tl;  i(n>)  =  Dl0Q(Af(l<n>y,  f<">)  =  0,  its  derivative  with 
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respect  to  the  vector  0(ni  is  the  zero  matrix,  i.e. 

0  -  -L-Dl0Q(M(tn)y,£n')  =  DM(l{n')-  Dt0Q{M(iin))-,#n))  *  Du<3(A^(g(n,);  g<">) 

(2-42) 

Let  9'  denote  the  limit  of  the  estimate  sequence.  Since  g<n"'1)  =  M(9^)  and  9'  = 
in  tii*  limit  as  9 in)  —  9".  equation  (2.42)  becomes 

0  =  DM(9’)D10Q(f;f)  +  DllQ(i‘,9’)  (2.43) 

Using  (2.38)  and  then  (2.37)  will  give  us  the  Jacobian  matrix, 

DM(9‘)  =  Di0H(t'\r)  [d*°Q(*-; £•)]’*  (2.44) 

This  result  appears  in  theorem  4  of  [2),  which  is  repeated  in  Appendix  A. 

The  rate  of  convergence  of  the  EM  algorithm  (2.44)  has  the  following  interesting  in¬ 
terpretation.  The  term  DKff(0’;$')  is  the  Fisher  Information  matrix  IX/r  of  X  given  Y 
about  9’ ,  i.e.  for  exponential  families  it  is  the  variance  of  the  sufficient  statistics  t(j)  given 
and  9'  The  term  D20Q(l' ;  9")  is,  for  regular  exponential  families,  the  Fisher  Information 
matrix  lx  of  X  about  9' ,  i.e.  it  is  the  variance  of  the  sufficient  statistics  t(i)  in  the  X 
model  without  any  measurements.  From  (2.37),  the  Fisher  information  ly  of  Y  about  9\ 
for  regular  exponential  families,  is  given  by, 

ly  =  lx  -  lx/r  (2-45) 


Thus,  in  the  scalar  case,  the  rate  of  convergence  is  given  by. 


lX/Y  .  W 


(2.46) 


If  the  complete  data  is  such  that  it  can  be  predicted  well  given  the  observations,  i.e. 


Ixjy  is  small,  then  a  is  small  and  the  EM  algorithm  converges  rapidly.  On  the  other  hand, 
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if  we  choose  the  complete  data  to  be  much  larger  than  the  observations,  then  the  complete 
data  will  carry  much  more  (Fisher)  information  then  the  observations;  Iy  lx  will  be  close 
to  zero,  a  will  be  close  to  unity  and  the  EM  algorithm  will  converge  slowly.  Indeed,  if  the 
complete  data  is  identical  to  the  observations,  the  EM  algorithm  converges  in  one  step: 
however  this  step  is  as  complicated  as  solving  the  original  ML  problem  in  the  Y  model. 
On  the  other  hand,  choosing  a  complete  data  that  is  much  larger  than  the  observations,  in 
order  to  get  simple  EM  steps,  will  require  performing  more  iterations  because  the  algorithm 
converges  slower. 

2.3  The  Linear  Gaussian  case 

This  section  has  two  objectives.  The  first  objective  is  to  provide  an  explicit  example 
of  the  application  of  the  general  EM  theory  developed  above  The  second  and  more  im¬ 
portant  goal  is  to  develop  results  that  are  referred  to  later  in  the  thesis  in  a  wide  range  of 
applications. 

Suppose  that  the  complete  data,  Z,  and  the  observed  (incomplete)  data  Y.  are  related 
by  the  linear  transformation 

L  =  HZ  (2.47) 

where  H  is  a  non  invertible  matrix.  The  complete  data  Z  possesses  the  following  multi¬ 
variate  Gaussian  probability  density: 

fn(i\9)=  det(yA(g))  exp  -£(i  -  jn(2))f A"l(g)(s  -  m(g))  (2.48) 

where  A  =  1  if  Z  is  real  valued,  A  =  2  if  £  is  complex  valued,  and  *  denotes  the  conjugate 
transpose  operation.  The  observations  also  possess  a  Gaussian  distribution,  and  thus 
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the  likelihood  is  given  by 


t)  = 


det(y  A„(£)) 


-kii 


exp 


--(g  -  mw(i))f -  m„(£)) 


(2.49) 


where  mv  and  Ay  are  respectively  the  mean  and  the  covariance  of  the  observations,  given 
by 


my(£)  -  H '  !ZS(£)  (2.50) 

Ay(£)  =  HA  (£)Hf  (2.51) 

We  note  that  our  parametric  model  is  such  that  the  parameters  define  the  mean  and  the 
covariance  of  a  Gaussian  density  in  a  possibly  non-linear  way.  Thus,  maximizing  the  like¬ 
lihood  in  this  linear  Gaussian  case  may  require  solving  a  non-linear  optimization  problem. 
Nevertheless,  we  will  be  able  to  invoke  results  from  linear  estimation  theory  and  explicitly 
derive  the  EM  algorithm  for  this  case. 

We  start  developing  the  EM  algorithm  for  maximizing  the  likelihood  of  the  observation, 
i2,  using  the  complete  data,  X,  by  examining  the  log-likelihood  of  2C  By  taking  the 
logarithm  of  (2. 48),  we  get, 

1°s/£.(i;£)  =  C  -  ~ logdet(  ya(2))  ~  “  nite))’ A"  *(£)(!  -  ni(2)) 

=  C-  £logdet(^A(£))  -  ;mf(*)A_1(2)zn(*) 

tm  Am 

-  £lfA-l(g)m(g)  *  ^(£)A-‘(£)I  -  ^r(A-l(£)uf)  (2.52) 

where  C  is  a  constant  independent  of  £  and  tr(  )  denotes  the  trace  of  a  matrix.  Maximizing 
this  expression  with  respect  to  £  is  assumed  to  be  easy. 

Taking  the  conditional  expectation  of  (2.52) ,  given  il  =  ^,  at  a  parameter  value  £*"*, 
we  get, 

Q(t,i{n))  =  £{  log /i(i;  £)/*£<">} 
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=  c  -  £  iogdet(yA(«)  -  +  j(*(",),A-1(«)!!|(|) 

-  ^r(A-ltf)*ln))  (2.53) 

where  *W  =  £  {l/£  =  y:  g(n| }  and  *<">  =  E  {xs} /Y  =  y; g(n) } . 

Maximizing  expression  (2.53)  with  respect  to  9  must  be  easy,  since  it  has  the  same 
functional  form,  with  respect  to  9,  as  (2.52). 

Since  &  and  Y  are  jointly  Gaussian,  and  related  by  a  linear  transformation,  the  condi¬ 
tional  expectations  required  for  (2.53)  can  be  computed  by  straight-forward  modifications 
of  known  results  from  linear  estimation  theory.  We  obtain, 

=  m(g<n))  +  r(£<n))  [5 ,  -  H-  m(g(,,))]  (2.54) 

*(n)  =  [/  -  r(£<n))  •  H ]  A(g‘n))  +  <ll"))(l(n))t  (2.55) 

where  l  is  the  identity  matrix  and  T(g)  is  the  “Kalman  gam”  defined  by 

r(2)  =  [ffAtij/r1]"1  (2.56) 

Note  that  if  we  set  0(n)  =  9,  equations  (2.54)  and  (2.55)  are  the  well  known  formulae  for 
the  conditional  expectation  in  the  Gaussian  case,  e  g.  .20]. 

The  E  and  M  steps  of  the  EM  algorithm  for  the  linear  Gaussian  case  may  now  be  stated 
explicitly  as  follows.  Having  a  current  estimate,  g(n),  the  algorithm  iterates  between, 

•  The  E  step 

Calculate  j(n)  and  by  (2.54)  and  (2.55).  Note  that  the  sufficient  statistics  of 
Gaussian  distribution  are  composed  of  linear  and  quadratic  functions  of  the  data. 

•  The  M  step 

Update  t  by  maximizing  the  expression  in  (2.53).  The  explicit  solution  is  some  func¬ 
tional  of  the  statistics  calculated  in  the  E  step. 
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2.4  The  EM  algorithm  with  varying  complete  data 


In  this  section,  we  present  a  variation  of  the  CM  algorithm  ,  where  the  complete  data 
may  vary  from  iteration  to  iteration.  This  variation  is  referred  to  as  the  Extended  CM 
(CEM)  algorithm. 

As  mentioned  above,  the  choice  of  complete  data  is  the  critical  factor  in  designing  an 
EM  algorithm  for  a  given  problem.  This  choice  determines  the  complexity  of  the  algorithm 
and  its  convergence  rate;  it  may  also  affect  the  convergence  ^point,  leading  to  a  different 
stationary  point  for  a  different  choice  of  complete  data.  An  alternative  to  choosing  a  fixed 
complete  data,  is  to  let  the  complete  data  vary  from  iteration  to  iteration.  The  choice  of 
complete  data  may  vary  according  to  a  fixed  rule  or  may  depend  on  the  current  value  of 
the  estimate.  By  allowing  the  complete  data  to  vary,  we  can  achieve  the  following  useful 
properties: 

•  Additional  iterative  algorithms  are  incorporated  in  the  EM  framework. 

•  Simpler  algorithms  may  emerge. 

•  The  algorithm  may  converge  faster. 

•  Varying  the  complete  data  may  enable  the  algorithm  to  escape  from  unwanted  sta- 
tionary  points. 

We  start  by  presenting  the  algorithm  formally,  and  giving  its  properties.  Then,  we  will 
motivate  the  EEM  algorithm  and  suggest  strategies  for  varying  the  complete  data. 
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2.4.1  General  theory 

Suppose  w#  observe  where  Y  denotes  the  sample  space  of  the  observations,  and 

the  probability  of  g  is  /V (y;  9)  indexed  by  9  €  0  The  observed  sequence  may  be  viewed  as 
being  incomplete  with  respect  to  a  family  of  complete  data  Xg ,  indexed  by  0  €  3,  where  5 
is  an  arbitrary  index  set.  Each  Xg  is  a  sample  space  with  an  associated  p.d.f.,  fx^izg  A), 
also  indexed  by  9  €  0.  For  any  i,  a  sample  of  the  complete  data,  is  related  to  the 
observations  by, 

X  =  Tg(ig)  (2.57) 

where  g  denote  the  observations  and  Tg  is  a  non-invertible  transformation. 

In  complete  analogy  to  (2.8),  we  may  write  for  all  3, 

logMs/;*)  =  E  {log  fx,(*ail)  j  ~  &  {l°f  fx*/r(zs/$S)  |  V?)  (2.58) 

or,  using  the  notation  in  section  2.1, 

L(9)  =  Qg(l9')-  Hg(9,9!)  (2.59) 

Using  this  relation  and  invoking  Jensen’s  inequality  we  may  prove  the  following  lemma. 

Lemma  2.1  For  a  given  parameter  value  9X,  if  for  any  3,  another  value  9j  satisfies 

Qa(£:,ii)  > 

then, 

Mi)  >  Mi) 

4 

Note  that  this  lemma  states  that  we  have  a  procedure  for  strictly  increasing  the  likeli¬ 
hood,  if  we  can  find  any  complete  data,  for  which  the  function  Qg  may  be  strictly  increased. 


The  Extended  EM  algorithm  is  now  presented  formally.  We  note  that  the  choice  of 
complete  data  in  each  iteration  may  depend  on  the  current  and  previous  estimates  and  on 
the  iteration  index. 

•  Start,  n  =  0  :  guess 

•  Until  some  convergence  criterion  is  met, 

-  Choose  a  complete  data  Xgin,,  where, 

3(n]  =  /(n,g(0>,---,g(B))  (2.60) 

-  The  E  step:  calculate 

QB<Aki(n])  *  ${log/x^((*,.,;t)|  *;«<">}  (2.61) 

-  The  M  step:  solve 

*("*l)  =  arg  max  <?*,*,  (g;0'n))  (2.62) 

£6® 

-  n  =  n  *  1 

The  proposed  EEM  algorithm  preserves  the  basic  monotonicity  property  of  the  EM  algo¬ 
rithm.  Since  the  convergence  properties  of  the  EM  algorithm  were  proved  using  the  Global 
Convergence  theorem,  they  will  follow  through  to  the  EEM  algorithm,  if  the  conditions 
developed  in  section  2.2  holds  for  every  3  €  3 .  These  properties  of  the  EEM  algorithm  hold 
regardless  of  the  rule  /  we  use  for  changing  the  complete  data.  A  carefully  designed  rule 
may  provide  an  algorithm  with  better  properties,  however.  Such  rules  will  be  suggested  in 
the  following  section. 


36 


2.4.2  Motivation  and  rules  for  changing  the  complete  data 

The  following  simple  situation  may  motivate  the  usage  of  the  EEM  algorithm.  Suppose 
that,  in  a  specific  problem,  two  complete  data  definitions  may  be  considered.  Each  choice 
of  complete  data  generates  a  different  algorithm  for  maximizing  the  likelihood  of  the  ob¬ 
servations  with  different  convergence  properties.  Following  the  EEM  idea,  we  may  switch 
between  these  algorithms.  If  one  specification  of  complete  data  generates  a  simpler  but 
slower  algorithm,  we  will  start  using  the  simpler  algorithm  and  then,  near  the  convergence 

t 

point,  switch  to  the  other  algorithm  to  converge  to  the  solution  faster.  If  one  algorithm 
converges  to  an  unwanted  stationary  point  of  the  likelihood,  which  is  not  a  fixed  point  of 
the  other  algorithm,  we  will  switch  to  the  other  algorithm  to  avoid  it. 

In  general,  the  family  of  complete  data  specifications  is  indexed  by  0  e  3,  where  5 
is  an  arbitrary  index  set,  say,  a  subset  of  the  k-dimensional  Euclidean  space.  The  choice 
of  complete  data  may  depend  on  the  current  estimate  of  the  parameters.  In  the  general 
case,  the  following  strategies  may  be  used  to  obtain  algorithms  with  better  convergence 
properties. 

Accelerating  the  convergence  rate 

Loosely  speaking,  the  rate  of  convergence  is  faster,  when  the  complete  data  can  be  better 
predicted  from  the  observations.  Thus,  changing  the  complete  data  in  each  iteration,  de¬ 
pending  on  the  current  parameter  model,  g'"' ,  in  such  a  way  that  it  may  be  better  predicted 
from  the  observations,  will  improve  the  rate  of  convergence. 

More  specifically,  an  EEM  iteration  is  given  by  g*""1*  =  .Vffl(g*n)).  It  satisfies 
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Since  the  vector  Dl0Qg(9^n*l)\9^)  =  D10Qg(Mg(&n})\  9<n))  =  0,  its  derivative  with  re¬ 
spect  to  the  vector  is  the  zero  matrix,  i.e. 

0=  Dl0Qg(Mg(9(n)Yi(n) )  =  DMa(9^)-Di0Qg(Mg(${n))\9^)+DnQg(Mg(8^y,9<n)) 

d$}n' 

(2.63) 

Now,  in  the  limit  as  n  —  oo,  0(n*  —  — «  S'  and  we  get 

DMg(Sin))  =  -DnQg(9<’t*l}-,9^)  [D20Qg(B(n+l)\ £(n))]  =  -  DllQg(lin]\ £W)  0<n,)J  ~ 

(2.64) 

The  largest  eigenvalue  of  DMg(S^)  will  define  the  rate  of  convergence.  To  accelerate 
the  convergence,  we  want  to  choose  a  complete  data  (i.e.  3),  that  will  minimize  this  largest 
eigenvalue. 

Depending  on  the  set  B,  it  may  be  possible  to  find  3  €  3,  in  terms  of  9^n\  that  solves 
the  following  equation 

DnQg(9(n],  £n))  =  f(3 , 9(n))  =  0  (2.65) 

In  this  case,  the  convergence  rate  of  the  EEM  algorithm  will  be  superlinear. 

Avoiding  unwanted  convergence  points 

We  wish  to  find  a  global  maximizer  of  the  likelihood  function.  However,  under  the  conditions 

of  section  2.2,  an  EM  algorithm  with  a  fixed  complete  data  specification  is  only  guaranteed 

to  convergence  to  a  stationary  point  of  the  likelihood.  Nevertheless,  not  every  stationary 

point  of  the  likelihood  is  a  fixed  point  of  an  EM  algorithm.  If  a  family  of  complete  data 

is  given  and  a  specific  stationary  point  is  not  a  fixed  point  of  ail  the  EM  iterations  that 

correspond  to  the  members  of  this  family,  then  following  lemma  2.1,  we  may  find  a  complete 

data  specification,  that  will  take  us  away  from  this  unwanted  stationary  point.  Once  we 
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have  avoided  a  stationary  point  and  we  find  a  parameter  value,  for  which  the  likelihood  is 
higher,  then  due  to  the  monotonicity  property,  we  will  never  return  to  this  stationary  point 
When  the  given  set  of  possible  complete  data  is  indexed  by  3,  the  following  rules  for 
choosing  3  in  each  iteration,  m  order  to  avoid  unwanted  convergence  points  are  suggested: 

•  Choose  a  random  3  in  its  domain. 

•  Search  for  3  that  give  the  largest  increase  in  the  likelihood.  If  searching  the  entire 
domain  of  3  is  complicated,  search  in  a  sub-domain,  which  may  be  picked  randomly 

One  may  argue  that  these  rules  are  heuristic  and  ad-hoc.  However,  the  whole  area  of 
global  optimization  of  non-convex  functions  is  heuristic,  and  depends  on  the  specific  goal 
function.  Our  approach  potentially  provides  an  improvement,  within  the  framework  of  the 
EM  algorithm,  in  the  sense  that,  even  when  it  fails  to  find  the  global  maximizer,  it  finds  a 
better  local  maximizer. 

2.5  The  EM  algorithm  for  general  estimation  criteria 

The  EM  idea  may  be  applied  to  general  inference  problems,  other  than  parameter 
estimation  problems,  and  a  variety  of  estimation  methods,  other  than  the  ML  method. 
We  will  start  by  suggesting  a  formal  structure  of  the  EM  algorithm  for  general  estimation 
methods.  Then,  using  the  general  Minimum  Information  criterion,  we  will  show  that  a  wide 
class  of  estimation  methods  reduce  to  optimizing  a  criterion  composed  of  the  log-likelihood 
and  an  additive  penalty  term.  An  EM  method  for  optimizing  these  criteria,  analogous  to 
the  EM  method  for  ML,  will  be  suggested. 


2.5.1  Formal  structure 


As  before,  let  Y  be  the  sample  space  of  the  observations,  and  X  the  sample  space  of  the 
complete  data. 

Suppose  we  observe  x  €  X.  We  want  to  find  a  model  or  a  structure,  jt,  that  will  “ex¬ 
plain’’  £.  Since  we  consider  statistical  inference  methods,  a  model  will  define  a  probability 
distribution  or  a  p.d.f.,  fx{i\  *),  over  the  set  X.  The  model  may  be  as  simple  as  a  param¬ 
eter  specification  or  as  complicated  as  a  full,  unconstrained  description  of  the  underlying 
probability  measure. 

The  a-priori  knowledge,  the  model  complexity  and  a  cost  function  for  measuring  goodness- 
of-fit  of  the  model  to  the  observation,  will  determine  the  procedure  for  estimating  this  model. 
There  are  many  ways  to  incorporate  knowledge,  complexity  and  goodness-of-fit  measures, 
which  explains  the  variety  of  criteria  for  statistical  inference.  However,  in  any  inference 
procedure  we  may  find  the  following  two  characteristics: 

•  Extraction  of  sufficient  statistics:  Not  all  the  observed  data  is  relevant  to  the  model 
estimation  goal.  Extracting  only  T(z)  from  the  data,  where  7(-)  is  many  to  one 
function,  is  sufficient. 

•  Optimization:  The  possible  models  are  compared  and  a  model  estimate  r  is  generated 
by  a  procedure  ?{T[g)}  =  »,  which  is  usually  a  result  of  solving  an  optimization 
problem: 

x  =  argmin  F(T(z);*) 

We  assume  that  given  x  €  X,  i.e.  given  the  complete  data,  we  have  a  satisfactory  solution 
for  the  model  estimation  problem.  In  other  words,  there  exists  a  way  to  incorporate  the 
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a-priori  knowledge,  to  measure  the  complexity  and  goodness-of-fit  of  a  candidate  model, 
to  calculate  the  required  statistics  and  to  solve  any  optimization  problem,  that  is  implied 
from  the  above.  A  satisfactory  solution  consists  of  a  set  of  formulas,  a  detailed  algorithm 
or  a  program.  Thus,  we  may  imagine  the  existence  of  a  “black  box”,  whose  input  is  a 
measurement  x  €  X  and  whose  output  is  a  model  estimation,  x. 

Suppose  now  that  we  observe  the  incomplete  data,  g  €  Y,  where  =  T{g)  and  T(  ) 
is  non-invertible  (many  to  one)  transformation.  Any  candidate  model  will  define  a  p.d.f., 
*■).  over  the  set  Y  where 

fx{x\*)dx  (2.66) 

■'£(») 

and  the  set  T(g)  is  given  by  (2.4). 

We  assume  that  we  do  not  have  a  satisfactory  direct  way  to  determine  the  model  or  the 
structure,  given  the  incomplete  observations,  either  because  we  cannot  specify  the  procedure 
for  determining  the  model,  or,  when  the  procedure  is  specified,  simply  because  implementing 
that  procedure  (e  g.  solving  the  implied  optimization  problem)  is  difficult. 

The  CM  algorithm,  which  we  now  formally  present,  is  a  possible  method  for  determining 
the  model,  given  the  incomplete  observations,  by  making  an  essential  use  of  the  availability 
of  a  satisfactory  estimation  procedure  for  complete  data  observations. 

•  Start,  n  =  0  ,  initial  model  x<°) 

•  Iterate  (until  some  convergence  criterion  is  met) 

-  The  E  step:  calculate 

7^  =  e{  T(i)  |r  =  *  *!">  }  =  f  7  (I  )/x/y  (£/*  *(">)d*  (2.67) 

■'*(») 
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-  The  M  step:  solve 


jr1"-1)  =  7  {7(n);>r}  =  argminF(7(n,;jr)  (2.68) 

2.5.2  The  EM  algorithm  for  the  Minimum  Information  criterion 

Minimum  Information  (MI)  is  a  general  method  for  solving  inference  problems,  suggested 
originally  by  Solomonoff  [2lj  and  recently  by  Hart  [22! .  This  method  generalizes  the  ML 
and  MAP  methods.  This  method  may  be  applied  to  situations,  where  a  general  structure 
or  model  it  should  be  estimated.  This  method  also  enables  to  incorporate  a  more  general 
a-priori  information. 

Given  data,  6  Y,  the  MI  method  estimates  the  model  k  by, 

it  —  arg  min  /(y,  t)  (2.69) 

where  /(•)  denotes  the  (self)  information.  The  joint  information  l(x£,  *)  may  be  written  as, 

*)  =  f(jf/'»)  !(*)  (2-70) 

where  /(j//v)  is  the  conditional  (self)  information. 

The  MI  criterion  implies  many  estimation  procedures,  since  there  are  many  notions  and 
definitions  of  information.  We  will  usually  use  the  more  quantitative  ones: 

•  Combinatorial  information,  due  to  Hartley  [231 . 

•  Probabilistic  (Shannon)  information,  due  to  Shannon  [24]  and  Wiener  [25]. 

•  Algorithmic  (Kolmogorov)  information,  due  to  Solomonoff  [21],  Kolmogorov  [26]  and 
Chaitin  [27] . 
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These  three  notions  of  information  are  summarized  in  26! . 

Shannon  information  is  the  most  adequate  for  /(g/ir),  since  a  given  model  provides  a 
probabilistic  description  of  the  observations.  Thus,  /(g;ir)  =  -  log/y(g; *),  and  the  MI 
criterion  reduces  to  the  minimization  of  likelihood-like  criterion, 

G(ir)  =  -  log/K(jr,  *•)-*-  /(jt)  (2.71) 

Shannon  information  may  be  adequate  to  describe  the  information,  given  by  specifying 
a  model  ir,  if  we  know  that  all  possible  models  belong  to  a  well  defined  set  II,  and  an  a-priori 
p.d.f.,  fn{*),  defined  over  II.  is  given.  In  this  case  /(t)  =  -  log/n(*)  and  the  MI  criterion 
reduces  to  the  MAP  criterion,  i.e.  we  estimate  the  model  by, 

t  =  arg  max  [log/K(£jr)  +  log/n(jr)]  (2.72) 

Other  examples  will  involve  the  algorithmic  notion  of  information,  which  measures  the 
information  of  an  observed  data  by  the  number  of  bits  needed  to  describe  it.  As  shown,  e  g. 
281 .  the  algorithmic  (Kolmogorov)  information  cannot  be  computed  when  no  constraints 
on  the  '‘language'’  used  to  describe  the  data  are  specified.  However,  given  a  constrained 
framework,  this  information  may  be  specified  quantitatively. 

A  special  case,  that  uses  the  algorithmic  information,  in  a  constrained  framework,  as  a 
criterion  to  weight  a  given  model,  is  known  as  the  Minimum  Description  Length  (MDL). 
This  criterion  was  suggested  by  Risssnen  [29,30,311 .  The  description  length  needed  to 
describe  the  model  was  given  explicitly  by  Rissanen  for  the  problem  of  determining  the 
parameters  0tl  •••,#„  together  with  their  number  n.  The  conditional  information  of  the 
data,  given  the  model,  is  interpreted  as  the  code  length  needed  to  describe  the  observation, 
given  the  model.  This  term  is,  as  above,  the  log-likelihood,  or  the  Shannon  information,  of 
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the  data,  given  the  model.  Thus,  in  this  case  the  MDL  criterion  requires  solving, 

n.9  =  arg  mini  -  log  /y(g;f)  ~  ■*»»  log  N\  (2.73) 

n ,  ^  «■ 

where  N  is  the  length  of  the  observation  sequence. 

For  all  these  cases,  where  the  conditional  information  of  the  data  with  respect  to  the 
model  is  the  probabilistic  (Shannon)  information,  the  MI  criterion  has  the  form  of  (2.71), 
and  the  EM  algorithm  of  (2.67) ,(2.68)  becomes, 

•  Start,  n  =  0  ,  initial  model  jt(0*  ' 

•  Iterate  (until  some  convergence  criterion  is  met) 

-  The  E  step:  calculate 

Q(x;  »<">)  =  E  {log  /*(*;  x)  1  jr,  »<">  }  (2.74) 

Note  that  this  step  corresponds  to  the  E  step  of  the  regular  EM  algorithm,  (2.17). 

-  The  M  step:  minimize 

t<^1)  =  arg  mm  [-Q(x;  *(n))  -  /(*)]  (2.75) 

-  n  =  n  —  1 

This  algorithm  was  suggested  in  2j  for  the  MAP  cnterton. 

It  is  easy  to  show  that  each  iteration  improves  (decreases)  the  likelihood-like  goal  func¬ 
tion  of  the  observation  (2.71).  The  goal  function  may  be  written  as 

G{x)=-  L(x)  +  I(x)  =  -  [Q(x-  x')  -  H(x- ir')j  *  I{x)  (2.76) 

where  <?(-;  •)  and  H(-\ ■)  are  defined  in  (2.11).  Thus  by  a  simple  extension  of  theorem  2.1, 
we  conclude  that 

C(x*"*1*)  $  G(*W)  (2.77) 
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where  equality  holds  if  and  only  if  both 


-  Q(ir«n-,>.  »<">)  *  /(irn^l))  =  -$(»<">,  »<">)  +  /(»<">)  (2.78) 

and 

/i'i :(*;  fr1'""1’)  =  *!n)).  »■«  >«  -f  (k)  (2-79) 

Due  to  the  monotonicity  property,  if  the  goal  function  is  bounded,  the  sequence,  {Gl"*}, 
where  (7tn)  =  G(Ttn)),  must  converge  to  a  limit  G' .  Also,  a  global  optimizer  of  the  goal 
function  must  be  a  fixed  point  of  the  algorithm. 

For  proving  other  convergence  properties,  such  as  convergence  to  a  stationary  point, 
convergence  of  the  model  estimate  sequence  and  the  rate  of  convergence,  we  need  that  the 
set  of  models  n  will  be  a  subset  of  a  metric  space.  Otherwise,  the  notions  of  distance, 
convergence,  continuity  etc.  are  undefined.  When  II  is  a  subset  of  a  metric  space,  those 
convergence  properties  are  readily  available.  We  will  use  the  results  developed  in  section  2.2, 
where  Q(i,9)  is  is  replaced  by  -<3(jt;  ir')  -  /(*).  Thus,  convergence  to  a  stationary  point 
is  guaranteed  if 

-  Q(ir.  jt')  i-  /(*•)  is  continuous  in  both  t  and  ir'  (2.80) 

The  rate  of  convergence  near  the  convergence  point  is  the  largest  eigenvalue  of 

-  D20H( x*;  »’)  [-Z?20g(T-;  jt* )  -  02f(*-)p  (2.81) 

where  D20Q[i’,t')  in  equation  (2.44)  is  replaced  by  -D20Q(ir';  *')  •+■  D2l{ ir). 

2.6  Possible  signal  processing  applications 

The  EM  algorithm  and  its  extensions  developed  in  this  chapter,  will  be  applied  later  in 

this  thesis  to  solve  a  variety  of  signal  processing  problems.  We  will  conclude  this  chapter 
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by  trying  to  characterize  the  problems  that  are  naturally  solved  using  the  EM  method. 

In  many  cases,  we  will  describe  the  possible  applications  of  the  EM  algorithm,  as  having 
noisy  and  incomplete  observations.  As  an  example,  we  became  interested  in  the  EM  algo¬ 
rithm,  while  considering  the  problem  of  power  spectrum  estimation  from  a  short  record  of 
observations.  The  modern  spectral  estimation  techniques,  e  g.  Burg's  Maximum  Entropy 
technique  {32j,  achieve  high  resolution  by  artificially  extending  the  observation  period  or 
the  autocorrelation  support.  This  solution  requires  the  exact  knowledge  of  the  autocorre¬ 
lation  values.  However,  the  sample  autocorrelation  values  are  only  noisy  estimates  of  the 
real  correlation  values.  In  our  opinion,  a  better  approach  for  high  resolution  spectrum  esti¬ 
mation  is  to  consider  the  short  observations  record  as  noisy  and  incomplete  and  to  model 
the  spectrum  estimation  problem  as  a  statistical  ML  problem.  Following  these  considera¬ 
tions,  we  have  presented  in  331,  a  parametric  spectrum  estimation  method  based  on  the 
EM  algorithm.  This  method,  suggested  originally  in  ;33l,  was  later  investigated  by  various 
authors  341,  351 . 

In  general  we  can  define  two  classes  of  possible  signal  processing  applications.  The 
first  class  contains  signal  processing  problems  having  partial  or  distorted  observations.  The 
problems  of  this  class  are  characterized  as  follows:  We  may  be  interested  in  estimating 
unknown  parameters  or  even  reconstructing  a  whole  waveform.  For  this  task,  it  is  desired 
to  measure  some  signals.  However,  we  observe  only  a  mapping  of  the  desired  signals,  such 
as, 

•  the  magnitude  of  the  signal(s)  or 

•  the  sign  or  the  hard  limited  version  of  the  signal(s)  or 

•  the  quantized  signal(s)  or 
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the  aliased  signal(s) 


or  any  other  partial  information.  Since  the  observations  are  distorted  and  incomplete  the 
statistical  problem  associated  with  the  signal  processing  problem  is  complicated.  The  EM 
algorithm  provides  a  natural  solution  to  these  problems,  where  the  complete  data  is  defined 
as  the  undistorted  signals.  The  algorithm  iterates  between  estimating  these  undistorted 
signals  and  updating  the  desired  parameters. 

The  second  class  of  applications  contains  signal  processing*  problems  for  which  the  ob¬ 
servations  are  described  as  a  combination  of  simpler  signals.  We  are  interested  in  estimating 
signal  parameters  or  reconstructing  a  signal  waveform;  however,  instead  of  observing  the 
desired  signals,  we  observe  a  combination  such  as, 

•  sum  of  signals  or 

•  multiplication  of  signals  or 

•  convolution  of  signals 

or  any  other  combination.  We  use  a  probabilistic  modeling  of  the  various  signals.  With 
the  observations  above,  the  signal  processing  problem  is  modeled  as  doubly  (or  multiply) 
stochastic  phenomena.  The  statistical  problems  generated  by  doubly  stochastic  models  are 
usually  complicated.  The  EM  algorithm  provides  a  natural  solution  to  these  problems, 
where  the  complete  data  is  the  set  consisting  of  all  the  separate  signal  components.  In  the 
doubly  stochastic  case  this  complete  data  is  equivalent  to  the  set  consisting  of  one  of  the 
uhiddenn  signal  components  and  the  combined  observations. 

Many  applications  that  belong  to  this  class  may  be  considered,  since  combined  signals 
are  common  in  many  practical  situations.  In  chapters  4  and  5,  we  will  consider  examples 
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that  belong  to  this  class  of  applications  and  solve  them  via  the  EM  algorithm. 

We  conclude  this  chapter  by  briefly  presenting  two  previously  suggested  applications, 
which  fall  naturally  within  of  the  EM  framework.  The  first  application  is  the  problem  of 
estimating  the  parameters  of  a  stationary  signal  in  a  stationary  noise.  The  EM  solution 
to  this  problem  is  referred  to  as  the  iterative  Wiener  filter.  The  second  application  is  the 
estimation  of  the  parameters  of  a  Hidden  Markov  Model  (HMM).  The  EM  solution  to  this 
problem  is  widely  known  as  Baum's  algorithm.  We  point  out  that  both  these  previous 
applications  belong  to  the  class  presented  above  and  are  thus  analogous.  We  believe  that 
an  additional  insight  to  the  HMM  analysis  can  be  gained  by  presenting  it  in  terms  of  an 
iterative  filtering  technique. 

The  iterative  Wiener  filter 

Let  s(t;0)  be  a  stationary  (discrete)  random  process,  and  suppose  that  this  process  is 
observed  in  additive  noise,  i.e.  we  observe 

y(0  *•(*;«)  •"»(<)  (2.82) 

where  n(t)  is  also  a  stationary  process.  We  are  interested  in  estimating  the  signal  parame¬ 
ters.  and,  in  some  cases,  in  filtering  the  signal. 

This  model  was  suggested  in  (3)  to  represent  a  speech  enhancement  problem  arising 
from  single  microphone  measurements,  in  this  case,  the  speech  signal  is  modeled  as  a 
stationary  autoregressive  (AR)  process  with  unknown  coefficients,  referred  to  in  the  speech 
context  as  Linear  Prediction  Coefficients  (LPC).  Speech  signals  are  frequently  modeled  as 
AR  processes,  since  this  model  captures  the  important  features  of  the  speech  signal,  at  least 
for  a  short  enough  observation  window.  We  may  be  interested  in  finding  the  speech  LPC 


parameters,  say  for  vocoding  or  for  use  in  a  speech  recognition  system,  or  in  enhancing  the 
speech  signal.  If  the  speech  signal  is  observed  without  the  noise,  the  LPC  parameters  are 
easily  estimated  by  solving  the  appropriate  normal  linear  equations.  However,  estimating 
the  parameters  from  noisy  observations  is  complicated. 

The  iterative  Wiener  filter  suggested  in  [3]  can  be  interpreted  as  an  EM  algorithm  as 
follows.  Let  the  complete  data  be  defined  as  the  speech  signal,  s(t;g),  and  the  noise  signal, 
n(r),  separately.  An  equivalent  definition  is  to  let  the  complete  data  be  the  speech  signal, 
s(t:  i)  in  addition  to  the  observed  signal.  y(t).  From  the  discussion  above,  if  we  observe  the 
signal  without  the  noise,  the  task  of  estimating  the  parameters  will  be  easy.  In  addition, 
having  the  signal  parameters,  we  may  estimate  the  signal,  i.e.  filter  it  from  the  noise,  using 
a  Wiener  filter.  This  suggests  an  EM  algorithm,  that  iterates  between  Wiener  filtering, 
applied  to  the  observed  signal,  using  the  current  spectral  (or  LPC)  parameters  of  the  signal 
(the  E  step)  and  updating  the  spectral  parameters  using  the  filtered  signal  (the  M  step). 

We  note  that  the  filtered  speech  signal,  i(t),  is  achieved  as  a  by-product,  while  imple¬ 
menting  the  E  step  of  the  algorithm. 

Hidden  Markov  Models  and  Baum's  algorithm 

Hidden  Markov  Models  (HMM)  are  interesting  and  rich  statistical  models,  that  have  been 
used  frequently  to  model  complex  real  problems.  The  iterative  Baum's  algorithm,  suggested 
in  [361,  which  is  now  recognised  as  an  instance  of  the  EM  algorithm,  was  suggested  for  the 
statistical  analysis  of  HMM.  These  models  are  extensively  used  for  modeling  the  mechanism 
that  generates  the  speech  signal,  and  are  applied  in  speech  recognition  systems.  A  review  of 
HMM  may  be  found  in  [37|,  and  a  review  of  their  application  to  automatic  speech  recognition 
may  be  found  in  [38).  We  will  now  briefly  present  the  Hidden  Markov  Models  and  Baum's 
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algorithm,  from  a  different  perspective. 

Suppose  we  observe  the  sequence  yo,  yi.  •  •  • .  SW-i-  A  hidden  Markov  model  assumes 
that  the  observations  are  probabilistic  functions  of  a  finite  alphabet  Markov  chain.  In  other 
words,  there  exists  a  hidden  Markov  sequence,  so,  S|,  •  •  • ,  s^v-i,  such  that  the  observed 
sequence  is  a  result  of  combining  the  Markov  sequence  with  another  stochastic  contribution, 
i.e.  the  model  assumes  that  for  each  »,  there  is  a  conditional  probability  distribution 

P(i*/s.  =  <**)  fc  =  1,  •  •  • ,  M  (2.83) 

where  {at,  ■  •  • ,  a*/ }  is  the  alphabet  of  the  Markov  chain. 

Using  this  point  of  view,  the  observations,  {y<},  in  a  hidden  Markov  model  are  the  result 
of  a  “signal”  -  the  markov  chain  {s,},  combined  with  “noise”. 

The  unknown  “signal”  parameters  in  this  case  are  the  transition  probabilities  repre¬ 
sented  by  the  matrix  and  in  a  non-stationary  case,  the  initial  probability  vector  * 
Sometimes,  the  “noise"  parameters,  i.e.  the  parameters  that  define  the  conditional  proba¬ 
bilities  of  (2.83),  are  also  unavailable.  The  ML  problem  for  estimating  these  parameters  is 
usually  too  complicated  to  solve  directly. 

The  complete  data  will  be  the  hidden  “signal”  in  addition  to  the  observations,  which 
is  equivalent  to  observing  the  “signal"  and  the  “noise”  separately.  This  complete  data  is 
analogous  to  the  complete  data  used  in  the  iterative  Wiener  algorithm.  Suppose  we  observe 
the  hidden  Markov  chain.  If  the  unknown  parameters  are  ail  entries  of  the  transition 
matrix,  the  ML  estimate  of  say  <9*.*  is  achieved  by  counting  the  number  of  transitions  from 
the  symbol  at  to  the  symbol  ct*,  divided  by  the  number  of  occurrences  of  the  symbol  at. 
The  “noise”  parameters  may  also  be  estimated  easily,  given  the  “noise”  realization,  which  is 
determined,  when  the  observations  and  the  underlying  hidden  Markov  chain  are  available. 
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The  specific  procedure  is  defined  depending  on  the  specific  parameterization. 

In  the  iterative  Wiener  algorithm  where  a  stationary  signal  in  a  stationary  noise  have 
been  considered,  the  E  step  was  implemented  in  the  frequency  domain  using  a  Wiener 
Filter.  In  the  HMM  case,  where  the  signal  is  a  Markov  chain,  Bellman’s  sequential  dynamic 
programming  algorithm  391,  can  be  used.  Bellman  s  algorithm  is  sometimes  referred  to  as 
the  Viterbi  algorithm.  The  E  step,  which  estimates  the  required  statistics  of  the  hidden 
Markov  chain,  needed  for  updating  the  parameters,  is  thus  implemented  by  an  efficient 
sequential  algorithm. 

The  detailed  Baum's  algorithm  is  presented  explicitly  in  [37],  page  11  and  elsewhere. 
The  interpretation  of  this  algorithm  as  an  iterative  filtering  algorithm  gives  an  additional 
insight,  that  may  help  suggest  enhancements  to  this  algorithm. 


Chapter  3 


Sequential  and  Adaptive 
algorithms 


In  this  chapter,  we  will  suggest  and  investigate  sequential  and  adaptive  algorithms, 
that  are  based  on  the  EM  concept.  Sequential  and  adaptive  algorithms  correspond  to  the 
case  where  the  data  is  processed  sequentially  and  an  output  is  expected,  whenever  each  new 
block  of  data  is  processed.  We  denote  the  n  —  l5'  data  block  by  ,  and  suppose  that 
the  desired  output  is  am  estimate  of  a  parameter  vector,  9.  The  general  structure  of  any 
sequential  (or  adaptive)  algorithm  is, 

(3.1) 

The  desired  output  of  a  sequential  algorithm  is  either  identical  to  or  at  least  asymptoti¬ 
cally  identical  to  the  result  achieved  by  processing  the  whole  data  at  once.  The  advantage  of 
the  sequential  algorithm  over  the  batch  algorithm  is  not  in  the  final  result,  but  in  computa¬ 
tional  and  storage  efficiency  and  in  the  fact  that  an  output  may  be  provided  without  having 


to  wait  for  all  data  to  be  processed.  Adaptive  algorithms  correspond  to  the  case  where  the 
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underlying  system  features  are  time  varying,  and  the  algorithm  is  expected  to  track  the 
varying  parameters.  In  this  case,  processing  ail  the  available  data  jointly  is  not  desired, 
even  if  we  can  accommodate  the  computational  and  storage  load  of  the  batch  algorithm 
and  can  afford  to  wait  to  the  end  of  the  data,  since  different  data  segments  correspond  to 
different  parameters  values. 

Sequential  and  adaptive  algorithms  may  be  suggested  baaed  on  an  iterative  algorithm 
in  the  following  way.  Given  the  current  estimate,  the  next  iteration  takes  into  ac¬ 
count  the  new  data  block,  jj^,,  for  generating  the  updated  value,  £ln”1*  \  well  known 

example  is  the  stochastic  gradient  algorithm,  which  is  an  adaptive  version  of  the  itera¬ 
tive  gradient  algorithm.  As  another  example,  the  recursive  least-squares  (RLS)  algorithm 
and  the  (extended)  Kalman  algorithm  are  sequential  algorithms  based  on  the  iterative 
Newton- Raphson  method.  Similarly,  the  iterative  EM  algorithm  may  suggest  sequential 
and  adaptive  algorithms.  These  algorithms  will  be  developed  in  this  chapter 

The  chapter  is  organised  as  follows,  in  section  31  we  will  develop  sequential  algorithms 
based  on  the  EM  method,  that  may  be  applied  only  when  the  underlying  estimation  problem 
has  a  special  structure.  In  section  3.2  we  will  use  approximations  and  develop  sequential 
and  adaptive  algorithms,  based  on  the  EM  method,  that  may  be  applied  in  general.  The 
sequential  algorithms,  presented  in  this  chapter,  will  be  analyzed  in  section  3  3 

3.1  Sequential  EM  algorithm!  based  on  problem  structure 

In  this  section,  we  will  identify  the  cases  where  the  underlying  estimation  problem  has  a 
special  structure,  and  suggest  sequential  EM  algorithms  that  exploit  this  special  structure 
The  next  section  will  demonstrate  that,  in  general,  a  sequential  algorithm  cannot  be  derived 
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as  a  direct  consequence  of  the  EM  method.  Only  using  approximations  will  we  be  able  to 
suggest  sequential  and  adaptive  algorithms  for  the  general  case. 

3.1.1  Sequential  EM  with  exact  EM  mapping 

Throughout  this  chapter,  we  will  consider  the  observed  data  as  blocks,  .  , 

to  be  processed  sequentially.  The  complete  data  is  denoted  Zi>  Zt> • ■  -  •  Zm  ■  ■  •>  and  is  chosen 
so  that  each  block  of  observed  data  corresponds  to  a  block  of  complete  data,  zn>  by 

e 

=  r.(&)  (3.2) 

where  Tn(  )  is  a  non* invertible  transformation. 

In  this  environment  the  log- likelihood  of  the  observations,  after  n  +  1  data  blocks  have 
been  observed,  is  given  by. 

£n-i($)  =  log/*,.,  K,(l1M.|. •".*,:*)  (3  3) 

Using  the  complete  data,  Zi,  •  •  • .  z**i.  and  following  the  identity  (2.12),  the  log-likelihood 
of  the  observations  may  be  written  as, 

L«+,{t)=Qn+l(kf)-  (3.4) 

where 

t)  =  E  {log  x,(z«»i.  •••.*!;$)  (3-5) 

t)  ~  E  |  log  /x,*,  (l»i4.l.  •  •  ■  .  ll<  i)  I  ■  '  ■  -  Ifjt  t }  (3.6) 

An  EM  algorithm  for  solving  the  maximum  likelihood  problem,  given  these  n  -  1  blocks 
of  data,  using  the  above  definition  of  complete  data,  is  given  by  the  following  iteration, 

f**"  =  argnuMg^id;^*')  =  argnuut£{log/x.*,  jr,(fc*i,-* -.ftil)  j  • 1  -  M**’ } 

(3.7) 
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where  k  denotes  the  iteration  index  and  n  the  data  index. 

A  sequential  EM  algorithm  with  exact  EM  mapping  is  a  method  that  recalculates  in  each 
iteration,  a a  more  data  is  processed,  the  exact  steps  of  the  EM  algorithm  for  maximizing 
the  new  likelihood  function.  For  convenience,  suppose  we  perform  a  single  EM  iteration  for 
each  new  observed  data  block,  i.e.  the  iteration  and  the  data  indices  are  equivalent.  This 
mapping  is  given  by  (3.7)  where  k  is  replaced  by  n.  This  EM  mapping  is,  in  general,  a 
function  of  all  given  observed  data  blocks;  thus,  it  may  written  abstractly  as, 

*,n+1)  =  ••.*,)  (3.8) 

The  exact  EM  iteration  may  be  implemented  recursively,  when  the  effect  of  the  past 
data  blocks,  jj  ,  •  •  • ,  y. ,  can  be  summarized  into  a  small  number  of  simple  quantities.  We 
may  algebraically  manipulate  the  given  expression  for  the  EM  iteration  and  achieve  an 
equivalent  expression,  that  may  be  written  abstractly  as  the  mapping, 

=  (t'-'iiu.,.  ((«..  •••?,>)  (>•») 

where  g  indicates  easily  stored  and  updated  functions  of  the  past  observations. 

We  will  assume  that  the  structure  of  (3.9)  may  be  achieved  for  all  n.  In  this  case,  we 
suggest  the  following  sequential  EM  algorithm: 

•  Start,  n  =  0  :  Guess  Initialize  g(-,  ••  •)  =  0 

•  For  each  new  data  block,  y_  . . 

-  Exact  EM  mapping:  Update  parameters, 

(3  10) 

-  Update  and  record  g(gn+1,  •  •  * ,  j^)  for  the  next  step 
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-  n  =  n  —  1 


In  each  step,  this  algorithm  implements  the  exact  CM  mapping  for  maximizing  the  new 
likelihood  £„»,(£),  and  thus,  Ln~i(9{n+l))  >  i(2(n)). 

This  algorithm  has  been  presented  abstractly  so  far.  To  fix  ideas,  we  now  present  a 
simple  example,  in  which  a  linear  least  squares  problem  is  solved  recursively  using  this 
algorithm. 

Example:  Sequential  Least  Squares  EM  algorithm 

It  is  well  known  that  the  linear  least-squares  problem  may  be  posed  as  a  statistical  maximum 
likelihood  problem,  in  the  following  way.  Suppose  we  observe  a  vector,  g  =  (yi,  ■  •  • ,  y„)r. 
given  by, 

J/=A-£-rj  (3.11) 

where  9  =  (0i,  •  •  • .  0*)r  is  the  unknown  parameter  vector,  n  =  (nl,  •  •  • ,  n,,)7,  is  the  noise 
vector,  where  {n,}  are  i.i.d  random  variables  distributed  normally  with  zero  mean  and 
variance  ff2,  and  .4  is  a  given  (n  x  k)  matrix,  which  may  be  written  by  columns  as  A  - 
4i ,  •  •  • ,  a*|  or  by  rows  as  AT  -  ,  •  •  • ,  .  In  this  case  maximizing  the  likelihood  of  the 

observation  yield  a  least-squares  problem  as, 

1  , 

Iml  =  argmaxlog/y(y;g)  =  argmin  —  i;t/  -  .4  ■  g'l*  (3.12) 

v  J  mO 

We  start  to  develop  an  CM  algorithm  to  this  problem  by  choosing  the  complete  data. 
Suppose  that  the  vectors  {l,}*=1  are  defined  as, 

i,  =4,  •*;-!!,  (313) 

where  q;  is  (n  x  1)  noise  vector,  whose  components  n},  are  zero  mean  Gaussian  i.i.d  random 
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variables  with  variance  J,<r2  Assuming  that  {n;}  are  uncorrelated  and  that  £*_j  3,  —  l, 
we  have 

k 

V  =  (3-14) 

j  =  i 

The  complete  data  is  defined  as  the  set  {x;}*=1.  Writing  the  complete  data  as  a  long  vector, 
ST  =  (xf ,  •  •  • .  J* ),  the  relation  between  complete  and  incomplete  data  is  given  by  y  =  H  ■  j, 
where  H  is  the  non-invertible  {n  <  n  •  k)  matrix 

H  ~  (/l/|  ••  -i/J 

'  ™  . . . 

k  times 

This  is  a  (simple)  Linear-Gaussian  case  situation;  using  the  results  of  section  2.3,  the  C 
and  M  steps  of  an  EM  algorithm  for  solving  the  least-squares  problem  of  (3.12)  are  given 
by, 
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(3.18) 


1  at  1  r-' 

£«=  ~  -4  U  =  -Law 

«=l 

Given  a  new  measurement.  y„+i,  we  can  update  An  and  recursively,  as, 

t  _  n  .  1  T 

n~1  ~  ~  ***  ^TT 

2n-l  =  (3.19) 


The  exact  EM  iteration  (3.17)  may  he  written  as, 


-  *•  ( T^h-y ■  •  ■  •  dbf) '  (*•-■  -  ■ ■ 8,nl>  (3  ■*» 

which  can  be  calculated  recursively,  since  all  required  quantities  are  calculated  recursively. 
The  sequential  least  squares  EM  algorithm  (SLSEM)  is  completely  specified  by  (3.19)  and 
(3.20). 


As  a  final  comment,  we  may  compare  this  SLSEM  algorithm  to  the  well  known  LMS 
and  RLS  algorithms,  which  also  solve  this  linear  least  squares  problem  recursively.  The 
LMS  algorithm  is  specified  by  the  following  recursion, 


£(-»  =  *(«>  _  <jan+1  -  aLi2ln)) 


(321) 


where  9k  is  the  tth  component  of  the  vector  £. 

The  RLS  algorithm,  which  exactly  solves  in  any  step  the  n(*  order  least-squares  problem 
is  specified  by, 

(v«*i  -  aL.2(n))  (3.22) 

where  .4"i,  may  be  calculated  recursively  as, 


j-l  _  a-l  * 

•"n-rl  -  .  T 

1  -  fitn*l 


(3.23) 
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We  notice  immediately  that  the  complexity  of  the  LMS  algorithm  in  each  iteration  is 
linear  in  the  number  of  unknowns  k,  while  the  complexity  of  the  SLSEM  algorithm  and  the 
RLS  algorithm  is  quadratic,  of  order  k2  The  SLSEM  algorithm  requires  less  computation, 
however. 

A  few  experiments  with  these  algorithms  indicate  that  the  convergence  of  the  SLSEM 
is  faster  than  that  of  the  LMS  algorithm.  The  convergence  of  the  SLSEM  algorithm  is, 
of  course,  slower  than  that  of  the  RLS  algorithm  since  the  RLS  algorithm  exactly  solves 
the  least-squares  problem  in  each  step.  However,  the  convergence  rates  of  the  RLS  and 
SLSEM  algorithms  to  the  true  value  of  the  parameters,  as  a  function  of  the  data  index,  are 
comparable. 

3.1.2  Sequential  EM  algorithm  based  on  recursive  E  and  M  steps 

The  sequential  EM  gorithm  presented  now  is  applicable  to  the  following  situation. 
Suppose  that,  given  the  complete  data,  there  is  a  sequential  (or  adaptive)  algorithm  for 
estimating  the  parameters.  Also,  suppose  that  the  required  statistics  of  the  complete  data 
may  be  estimated  recursively  from  the  observations.  In  this  case,  as  each  new  block  is 
observed,  the  necessary  new  complete  data  statistics  are  estimated  recursively,  given  the 
current  value  of  the  parameters  -  the  E  step.  Then,  the  parameters  values  are  updated 
sequentially,  given  the  new  estimated  block  of  complete  data  statistics  •  the  M  step. 

To  be  more  specific,  suppose  that,  if  the  complete  data,  ii,  ij,  ■  •  • ,  l„,  •  •  •,  is  observed, 
there  is  a  sequential  (or  adaptive)  algorithm  for  estimating*  the  parameters,  i.e. 

i{n*l)  =  C  ({(i„+1),r(in,---,il),g<n))  (3.24) 

where  g(zn*i)  &re  the  statistics  to  be  extracted  from  the  new  data  xn+i-  The  statistics 
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r(xn,  •  ••,£[)  are  extracted  from  the  past  data.  Since  (3.24)  represents  a  sequential  proce¬ 
dure,  these  statistics  are  easily  stored  and  updated. 

Unfortunately,  the  complete  data  is  not  available.  We  only  observe  the  incomplete  data, 

.  yn,  -.so  we  cannot  estimate  the  parameters  via  (3.24).  Following  the  EM  idea, 
we  should  estimate  the  complete  data  statistics,  needed  for  the  sequential  algorithm  of 
(3.24),  by  taking  their  conditional  expectation,  given  g  ,  •  •  • ,  and  the  current  value, 
0*"*,  of  the  parameters.  These  conditional  expectations  may  be  calculated  sequentially  too. 
Consider,  for  example,  the  case  where  the  complete  data  is  composed  of  two  Gaussian 
signals,  say  s(n)  and  u/(n),  and  the  observed  data  is  the  sum  of  these  signal,  i.e.  y(n)  = 
s(n)  w(n).  The  conditional  expectation  of  the  complete  data,  given  the  observation,  in  a 
non-sequential  EM  algorithm  requires  a  Wiener  filter.  However,  this  conditional  expectation 
may  be  achieved  sequentially,  using  a  causal  Wiener  filter  or  a  Kalman  filter. 

A  sequential  procedure  for  estimating  the  required  complete  data  statistics  may  be 
described  abstractly,  e  g., 

c {«(&-.) Iw". *r  «“* )  - .*<«.. ••••«■>! «(")  <»•»> 

where  the  functions  g,  summarizing  the  contribution  of  past  observations,  are  easily  stored 
and  calculated. 

A  sequential  (adaptive)  EM  algorithm,  based  on  recursive  E  and  M  steps  is  presented 
formally  as  follows. 

•  Start,  n  =  0  :  Guess  Initial  £,(•,•••)  =  0 

•  For  each  new  data  block  ( , 


60 


-  Sequential  E-step:  calculate 


t  =  £{?(2„w)  ‘  '¥.1'-^  }  = 

r  =  c  {*:(£»>*-- .Zi)  i  v„>  •  >  ^i;  ^l”)  }  =  ^r(lMnr--,^y,9jn})  (3.26) 

-  Sequential  M-step:  Update  parameters 

«("+1>  =GU,r,tf(n|)  (3.27) 

-  Update  and  record  •  •  • .  ^i).2r(stn,j.  •  •  • ,  gj)  for  the  next  step 

-  n  =  n  —  1 

Suppose  that  the  recursive  procedure  (3.24),  suggested  when  the  complete  data  is  given, 
increases  the  likelihood  of  the  complete  data  IJUi(2),  i.e.  it  satisfies, 

Ln+  x(#H*l))  >  Len-l(&])  (3.28) 

In  this  case,  it  is  easy  to  show  that  the  recursive  algorithm  su  ggested  by  (3.26)  and  (3.27) 
increases  the  current  likelihood  of  the  observations,  as  follows.  This  is  true  because  the 
function  (0;  0(n')  has  the  same  functional  form  as  I*»i(£),  where  the  estimated  statis¬ 
tics,  t  and  r,  are  substituted  in  place  of  the  statistics,  t  and  r.  Thus,  if  (3.28)  is  true,  then 
(3.27)  implies 

Qn*.(^1,;£(n))  >  Q^iU1"1;^"')  (3.29) 

Using  (3.4),  (3.29)  and  Jensen's  inequality  we  get,  , 

>  L*.i(«(n))  (3.30) 
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3.2  Sequential  EM  based  on  stochastic  approximation 


The  sequential  algorithms  presented  so  far  were  suggested  by  assuming  that  the  under¬ 
lying  problem  had  a  special  structure.  In  this  section,  we  will  address  the  general  situation. 
Unfortunately,  sequential  algorithms  may  not  be  derived  directly  from  the  EM  algorithm  in 
the  general  case.  We  will  therefore  suggest  algorithms,  that  approximate  the  EM  iteration, 
in  order  to  get  a  recursive  implementation.  We  will  be  able  to  show  that  these  algorithms 
belong  to  the  class  of  stochastic  approximation  algorithms,  for  which  a  general  theory  is 
readily  available. 

3.2.1  The  EM  algorithm:  General  sequential  considerations 

The  log-likelihood  of  the  observations,  given  n  +  1  data  blocks,  is  given  by  (3.3).  Define, 

£«-i/n(£)  =  log/jWK,  r,  ••••&;*)  (331) 

The  log-likelihood  of  the  observations  may  be  written  recursively  as, 

Ln+i(i)  =  Ln(S)  -  Ln-l,n(l)  (3.32) 

or  as, 

Ln+i(t)*Li(i)  +  EL,+i„(t)  (3-33) 

i=i 

In  order  to  develop  a  recursive  algorithm,  we  refer  to  the  recursive  formula  for  the 
log-likelihood  (3.32).  Analogous  to  (3.4),  the  term  Ln  may  be  written  as, 

Ln(i)  =  Qn(t  O-HM)  (3.34) 

where  the  complete  data  is  defined  to  be  £i,  ■••,£„■  For  the  term  Ln*ijn>  the  complete 
data  is  Xn-i  and  following  the  same  considerations  which  lead  to  (3.34),  we  may  write, 

U+l/nit)  =  Q*+l,n(b  t)  -  9!)  (3.35) 
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where 


Qn*lin(i,g)  -  E  {log  Y,  Un+ 1  /  Hn  >  ’  ’  ' 

(3.36) 

n~l,n{9->g)  ~  E  {  log  ,  yn.|  Kt  (in+l/ Sfn+l  ’ 

'"’Hi’-)  i  Hn~r' 

(3.37) 

Therefore, 

Ln+xit)  =  Ln(t)  -  I-l/nte)  =  Qn($;  g)  +  Q„+l/n(£;  g)  ~  [*«(&  g)  -  /***,/„(&  *)] 

,  (3.38) 

and  we  have, 

Hn(t,g)  <  Hn(g,g)  and  Hn+X,n(tt)  <  »n*i/n(g,g)  (3  39) 

One  could  try  to  achieve  a  recursive  algorithm  by  maximizing  either, 

Qn(kgn))  +  Q„i,n(tt(n))  (3  40) 

or, 

Qi(t,  9{n))  -  Qvl(9-  £<">)  -r  •  •  •  -  Qn-i,n(&  i(n> )  (3.41) 

since  maximizing  either  (3.40)  or  (3.41)  will  generate  a  new  value  fll'*"’1*  that  increases  the 
likelihood  Ln*i(9)  However,  despite  their  seemingly  recursive  structure,  these  maximiza* 
tions  cannot  be  performed  sequentially  in  general,  because: 

•  Calculating  <?n+i/n  involves  the  past  data  •  •  • , 

•  For  each  new  parameter  value,  the  conditional  expectations  needed  for  the  term  <?„, 
or  the  terms  <?i, Q2/1,  •  •  • , Qn/n-i,  should  be  recalculated.  This  requires  using  the 
past  data  samples. 
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An  approximate  sequential  algorithm 

From  the  discussion  above  we  conclude  that  a  general  sequential  algorithm,  that  will  imple¬ 
ment  the  desired  maximizations  of  (3.40)  or  (3.41),  cannot  be  specified.  However,  consider 
the  following  sequential  algorithm, 

•  Start,  n  =  0  :  Initialize  ♦o(£)  =  0.  Guess 

•  For  each  new  data  g  , 

-  E-step:  calculate 

Q^l/n(l l[n])  =  E{\og/x(i*+ui)\iln+vlU’--’  *in) }  (3-42) 

-  M-step:  solve 

=  arg  max  faUl/n(g,  t (n])  -  3n  •  *„(«)]  (3.43) 

-  Record  for  next  step 

*n+l(9)  =  Qa„Vn(k9[n))~  $*■*»(*) 


-  n  =  n  —  1 

This  algorithm  approximates  the  desired  procedures  as  follows.  First,  the  term  Qn+\/n  {0\9{n]) 
is  approximated  by  <?‘+I/n(f,  |(n)),  given  by  (3.42).  We  will  use  in  this  approximation  some 
past  data  values,  •  •  ■ ,  %n_m,  as  long  as  Q^l/n  is  calculated  recursively.  We  note  that,  if 
the  different  observation  blocks  are  independent,  <?**!/*  =  Qn~i/n  1°  general,  the  weaker 
the  successive  observations  blocks  are  correlated,  the  better  this  approximation  becomes. 
Second,  the  previous  terms  are  not  recalculated.  We  calculate  each  using  the  cor¬ 

responding  parameter  value,  g*1*,  and  we  simply  accumulate  these  functions  and  generate 
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^n(i)  recursively.  Also  using  this  algorithm,  the  previous  terms  may  be  weighted,  according 
to  the  choice  of  5„.  By  an  appropriate  choice,  we  may  reduce  the  contribution  of  the  past 
data  and  track  varying  parameters  in  the  adaptive  situation,  or  we  may  weight  the  past 
data  more  heavily,  to  guarantee  convergence  and  consistency,  for  a  sequential  algorithm. 

Although  it  seems  that  this  algorithm  is  based  on  ad-hoc  approximations,  we  will  be 
able  to  show  that  this  algorithm  belongs  to  the  class  of  stochastic  approximation  algorithms. 
Later  in  the  chapter,  we  will  use  results  developed  for  stochastic  approximation  algorithms 
to  calculate  the  asymptotic  distribution  of  the  estimator  and  prove  its  consistency.  However, 
before  that,  we  will  briefly  present  the  stochastic  approximation  idea. 

3.2.2  Stochastic  approximation 

In  a  typical  problem  for  stochastic  approximation,  a  sequence  of  random  variables  (vectors), 
{yn}.  is  observed.  We  assume  that  the  sequence  is  stationary.  ;n  the  sense  that  each  has 
the  same  marginal  distribution,  and  that  it  is  ergodic  At  each  instance,  a  function  of  the 
observed  data  and  the  desired  parameters,  L(9\y_n),  is  given.  We  want  either  to  optimize 
the  unavailable  ensemble  average  of  L,  i.e.  to  find 

max  Ey  |  L(£;  j/n)j  =  max  L{9)  (3.44) 

or  to  solve  an  equation,  that  involves  this  unavailable  expected  value,  e  g. 

=°  orL(9_)=0  (3.45) 

The  first  problem  is  sometimes  referred  to  as  the  Kiefer- Wolfowitz  (K-W)  problem  [401 . 
The  second  problem  is  referred  to  as  the  Robbins-Monro  (R-M)  problem  [41] .  By  defining 

=  (>■«) 
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a  K-W  problem  for  L  may  be  reduced  to  a  R-M  problem  for  V . 

Suppose  we  have  an  iterative  algorithm  for  optimizing  L(9)  of  (3.44)  or  for  solving  the 
(non-linear)  equation  (3.45).  For  example,  the  gradient  iterative  algorithm  for  maximizing 
(3.44)  will  be. 

0(n+i)  _  g<«)  _  3  .  JL £($<»))  =  g<">  +0.  Ev  | ^I(gw)J  (3.47) 

We  cannot  implement  this  iterative  algorithm,  since  the  expected  value  of  L  and  its 
derivatives  are  not  available.  The  stochastic  approximation  idea  is  to  approximate  the 
expected  value  by  the  sample  value.  Since  we  have  an  ergodic  sequence  of  realizations, 
{gn},  the  next  iteration  is  performed  using  the  next  realization.  This  achieves  time-average 
that  approximates  the  unavailable  ensemble  average  values. 

Specifically,  the  stochastic  approximation  of  the  gradient  algorithm,  referred  to  as  the 
stochastic  gradient  algorithm,  is  given  by, 

*  *<«)  _  (3.48) 

In  42!  and  43!  it  is  shown  that,  if  the  observations  sequence  is  ergodic  and  if  (4«)  is  a 
sequence  of  positive  numbers  such  that  lim*— «  3n  =  0  and  it  satisfies. 

00  X 

^2  3n  =  oo  and  J*  <  oc  (3  49) 

n=0  «=0 

e.g.  3n  =  3/n,  then  the  stochastic  gradient  algorithm  converges,  in  probability  1  and  in 
the  mean-square  sense,  to  the  right  solution  of  (3.44). 

We  note  that,  if  the  observed  data  is  not  stationary  and  we  are  looking  for  an  adaptive 
algorithm,  then  we  usually  choose  constant  gain  3  in  some  range,  instead  of  { 3 «}  as  in 
(3.49).  This  way,  we  reduce  the  weight  of  past  observations  and  use  the  new  input  to  track 
varying  parameter  values. 
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3.2.3  The  EM  stochastic  approximation  algorithm 

The  best  statistical  parameter  estimation  method,  which  we  caa  hope  to  find,  is  a  method 
that  solves  the  following  optimization  problem: 

9  =  argmaz  (log /r,., (TU~v 4)}  =  «“l( nax  J(f)  (3  50) 

This  is  because  the  solution  to  this  problem  is  the  tree  parameter  value.  To  prove  this 
claim,  we  note  that  using  Jensen  s  inequality,  we  get 

J(l)  =  /  log/r...(gn.1l)i/r.,1(|w,l.l,^.)d*<  7(1^.)  (351) 

i  e.  itrut  maximises  J(i).  We  note  that  the  equality  in  (3.50)  is  achieved,  if  and  only  if 

/K...(Wt>  =  •wywkere 

The  maximization  of  J(t)  caa  be  accomplished  uaiag  New  too-  Raphsoe  method  or  any 
other  optimization  method.  Instead,  we  will  use  an  iterative  algorithm  for  maximizing  J(t) 
based  on  the  considerations  leading  to  the  CM  algorithm,  as  follows 

Using  x„_i  as  the  complete  data  with  respect  to  i  and  following  the  method  used 
for  deriving  (2  12).  we  may  write  7(f)  as, 

7(t)-<3(€;f)  -  *(f  f)  U«l 

where 

4(1:0  I)  «...  C))  H«l 

sitn-r,..,  {*{!««/*..,  «...  «  t}\  <«**> 

Considering  the  function  #(#;!*"').  it  is  easy  to  show  using  Jensen  s  inequality  for  the 
expresmoa  lassde  the  expectations,  that. 


(Hi  t)  <  Bit  f) 

«T 


(3S») 


Analogous  to  the  EM  algorithm,  an  iterative  algorithm  for  maximizing  J(t)  is  given  by. 


f  *11  =  arg  max<J(f.  )  =  argmax£?  ^  {  E  { log  (in-i  - 1)  }}  (3  56) 

where  from  |3  54)  and  (3  55).  each  iteration  of  (3  56)  increases  J{t) 

Unfortunately,  it  is  impossible  to  implement  (3  56).  since  the  expected  value  with  respect 
to  g  is  not  available  Using  the  stochastic  approximation  idea,  a  stochastic  realization 
of  (3  56)  is  performed  as  the  (n  -  11"  data  block  is  observed.  Thus,  we  get  the  following 
stochastic  approximation  algorithm 

i'"*11  =  argmax£|  log }  (3  57) 

Following  the  notation  used  in  the  beginning  of  this  section  we  will  define 

..(If-)  =  fjlog/x... (!,.,,«)  |f„M.r'(  (3  58) 

In  a  sequential  algorithm,  the  new  data  block  together  with  the  past  data  blocks  should 
provide  a  time-average  approximation  to  the  ensemble  average  of  (3  56)  This  may  require 
weighting  the  past  data  more  heavily  Thus,  we  define  recursively  a  function  i($)  as. 

♦...(!)  =  „(!  f"1)  -d*  *n(t)  (3  59) 

and  the  general  stochastic  EM  step  will  be 

f-,'*argmn x  #„.,(<)  (3  60) 

which  is  the  algorithm  suggested  in  (3  43) 

We  note  that  this  algorithm  was  also  suggested  in  44i,  during  the  investigation  of 
approx  i  mat  ions  to  the  stochastic  Newton  algorithm 
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3.3  Some  properties  of  the  sequential  EM  algorithms 


Analysis  of  sequential  and  adaptive  procedures,  especially  in  a  statistical  or  stochastic 
context,  has  been  the  subject  of  extensive  research  efforts.  The  properties  of  the  stochas¬ 
tic  approximation  method,  being  the  simplest,  may  be  found  in  various  references,  e  g. 

43,45,46,47]  and  elsewhere.  This  topic  is  probably  one  of  the  most  difficult  subjects  in  math¬ 
ematical  statistics:  investigating  convergence  of  complicated  stochastic  structures,  proving 
convergence  in  probability,  in  probability  1  or  in  the  mean-squares  sense  and  finding  the 
rate  of  convergence  requires  using  advanced  probabilistic  tools  from  Martingale  theory  and 
stochastic  calculus  theory  Thus,  a  typical  assumption  made  in  most  of  the  references  above, 
in  order  to  simplify  the  analysis,  is  that  the  observed  data  blocks  are  independent. 

The  analysis  of  the  sequential  £M  algorithms  for  the  stationary  case,  presented  below, 
is  far  from  complete.  Nevertheless,  the  following  results  were  achieved: 

•  General  asymptotic  consistency:  We  will  show  that  the  estimator,  generated  by  a 
sequential  EM  algorithm,  is  asymptotically  consistent,  when  the  ML  estimator  is 
consistent  and  the  sequential  EM  iteration  converges  to  a  stationary  point. 

•  Limit  distribution:  The  limit  distribution  of  the  estimator  will  be  given  for  some 
sequential  EM  algorithms.  These  results  are  for  independent  observations,  however. 

The  properties  of  the  sequential  EM  algorithms  should  be  investigated  further  Detailed 
analysis  may  require  the  use  of  more  advanced  mathematical  tools.  It  is  an  interesting 
research  topic  in  mathematical  statistics.  The  book  by  Kushner  47],  together  with  the 
EM  ideas  and  the  preliminary  analysis  of  the  sequential  EM  algorithms,  presented  in  this 
chapter,  should  provide  the  starting  point  for  this  research. 
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3.3.1  General  asymptotic  consistency 

The  analysis  of  sequential  algorithms  when  the  observations  are  not  independent  may,  in 
general,  be  quite  complicated.  However,  some  sequential  CM  algorithms  have  the  property 
that  in  the  limit  the  convergence  point  is  a  stationary  point  of  the  corresponding  likelihood 
limit.  Using  this  property  we  will  prove  the  asymptotic  consistency  of  these  algorithms 
as  follows.  The  sequence  of  normalized  log-likelihood  functions  at  each  instance,  £ L„(f ), 
are  shown  to  converge  in  probability  1  to  a  limit,  /(f),  whose  unique  maximum  is  the  true 
parameter  value,  &true •  Under  regularity  conditions,  the  derivative  of  the  likelihood  also 
converges.  Since  the  sequence  of  sequential  CM  estimates  converges  to  a  stationary  point 
of  the  likelihood,  i.e.  to  zero  derivative  point,  it  converges  to  a  zero  derivative  point  of  1(f) 
which  is  its  maximum,  i.e.  the  true  parameter  value. 

Specifically,  as  discussed  in  Appendix  B,  for  a  class  of  ergodic  sources,  which  include, 
for  example,  all  finite  Markov  sources,  the  sequence  *£«(£)  where  Ln(i)  «  given  in  (3  3), 
converges  uniformly  in  probability  1  to, 

1(f)  =  J  (yf».yfi-i---if»ru*),0*/r,,K,-,.  (y**/y*»-i.  --  f)<fy»<fyn-i  <fyi 

(3.61) 

Intuitively,  the  sources  that  belong  to  this  class  are  ergodic  sources,  whose  memory  fades 
fast  enough.  This  result  is  also  discussed  in  48j 

The  function  /(f)  achieves  its  maximum  at  f  =  f*ru4.  Under  regularity  conditions  and 
the  convexity  of  (3  61).  9lrut  is  the  unique  solution  to  the  equation  Dl{i)  =  0.  Now,  using 
this  fact  and  some  well  known  results  from  analysts,  the  following  theorem  may  be  proved. 

Theorem  3.1  Let  the  observations  .  •  •  • .  ■  •  •  be  generated  by  an  ergodic  sonrce  for  wkteh 
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(S.S1)  holds.  Let  be  an  instance  of  a  sequential  EM  algorithm  suck  that  for  any 

realisation  of  the  observations. 

(t)  the  sequence  of  estimates  {0,n|)}  converges  to  a  limit  9’ 

(.«)  limn— oa  Dl0Qn^(9Jn'l):  9J ">}  =  0 

Then,  tn  probability  1,  as  n  —  oo.  9^n>  —  9 true- 

The  proof  of  this  theorem  is  also  given  in  appendix  B. 

The  conditions  of  theorem  3.1  may  be  verified  for  many  specific  sequential  EM  algo¬ 
rithms.  For  example  the  sequential  EM  algorithm  with  exact  EM  mapping  satisfy  the 
condition  (it)  above,  for  all  n;  thus,  if  the  observation  sequence  is  ergodic  and  satisfies 
(3.61),  whenever  the  algorithm  converges,  it  generates  a  consistent  estimator 

3.3.2  Limit  distribution 

The  asymptotic  distribution  of  several  sequential  EM  algorithms  can  be  calculated  using 
the  following  technique.  The  recursion  defined  by  these  algorithms  will  be  approximated  by 
a  recursion  that  resembles  stochastic  approximation  algorithms,  especially  the  stochastic 
Newton  method.  Having  this  similarity,  we  will  be  able  to  invoke  results  developed  in  the 
stochastic  approximation  context  and  show,  in  some  cases,  that  in  the  limit  the  estimator 
is  distributed  Normally  around  the  true  parameters  value  and  has  v  n  consistency,  i  e  its 
variance  tends  to  zero  as  1  v  n  We  note  that  the  possible  connection  between  the  sequential 
EM  algorithm  and  the  stochastic  Newton  method  was  pointed  out  in  44 

Consider,  for  example,  the  stochastic  EM  algorithm  defined  by  the  recursion  (3  60). 
repeated  here. 

S'”'11  *  arg max  ♦„»,(£) 
where  ♦„»!  is  defined  recursively  by 

•*.,<*)  =  <?:.,  Unit) 
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and 


Note  that  by  the  construction  of  the  stochastic  EM  algorithm,  9}n]  maximises  ♦„(£) 
We  will  assume  regularity  conditions,  that  allow  certain  operations,  e  g.  differentiation 
under  the  integral  sign. 

An  approximation  to  the  stochastic  EM  recursion  may  be  achieved  if,  instead  of  solving 
the  maximization  problem  of  (3  60).  a  Newton- Rapheon  step,  starting  in  £<n>.  is  performed 
The  resulting  recursion  is 

D+„.,(£1"’)  (3  62) 

The  gradient  vector.  D+n-! .  and  the  Hessian  matrix.  Z?2*,., ,  are  also  given  recursively. 

(t‘n’)  =  Dl0QW"  ^()  -  Jn  (3  63) 

02*,-i(«'"')  *  D,0<?(ftnl .t"  *„_,)  -  Jn  ^*,(f",,)  13  641 

However.  (#*’'')  =  0  since  0'"1  maximises  ♦  „  Also,  from  i2  36) 

=  D\ot  wi,.rn  =  si^.r ) 

For  exponential  families,  the  second  derivative  of  Q  is  such  that  -  D*Q  is  the  Fisher 
information  matrix  of  the  complete  data,  lx  calculated  at  f'*'  Thus 

0,*,.i(*"’)  =  i -/»«#»—)  -  Jn-I 

1 3  65) 

and  soon  If  J„  *  1  then  Di'¥«-\[?'u)  *  -<*»  -  U/x(f'n')  lo  this  case  from  i3  62)  the 
stochastic  EM  ner»i.ion  is  approximated  by 

t^l)  •  -  —li'it")  ?<*„  ,  r-i 

n  -  l  "  ' 
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Recursions  like  (3.66)  are  typical  in  stochastic  approximation  algorithms.  For  example, 
replacing  /x(£(,,i)  by  /y(£'''1)  =  D2 L{9{n^)  in  (3.66),  yields  the  stochastic  Newton  method 
It  has  been  shown,  e  g.  in  49<  and  46i  that  the  recursion  (3 .66)  generates  an  estimator. 
£,n) ,  which  satisfies,  under  regularity  conditions  and  provided  that  2 Iy{ixrut)  >  fx(^rU,). 

V»(!w  -  *!,«)  —  .v  (o.  ix2iy{2Iyix1  -  I)-1)  (3  67) 

in  distribution,  as  n  -  x  The  Fisher  information  matrices  lx .  Iy  in  (3 .67)  are  evaluated 

t 

at  &trut ■ 

When  2 Iy ( 9trut )  <  Ixdtru. «)■  (3 .67)  above  does  not  hold,  (although  the  stochastic  EM 
algorithm  may  still  yield  a  consistent  estimator).  However,  if  we  choose  the  coefficients  J« 
in  such  a  wav  that  the  stochastic  EM  algorithm  is  approximated  by 

f'"1 - t/v1^’'')  •  S(y  ,  9tnl)  ( 3  68) 

and  0  <-  a  <  2Iy  ( 9trut )  lx  1  (&„„,)  <  1.  then  according  to  46 

n1  :!f'  «„«)  -  -V  (0. l^lydlylx1  -  a)'1)  (3  69) 

m  distribution,  as  n  —  x.  and  the  asymptotic  Normality  and  v  n  consistency  hold 

4  similar  derivation  using  a  Newton- Raphson  approximation  can  be  performed  for  the 
sequential  EM  algorithm  with  exact  EM  mapping  This  algorithm  generates  estimates 
according  to  the  mapping  (37)  that  is 

fn'u  *  ergma*Q,.i(M*’*'  y,) 

which  may  be  approximated  using  a  step  of  Newton- Raphson  algorithm  by 

f"  *r"  -  D'Qn.  ,cf«—  r*  I  r"  -i  uroi 
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Using  the  fact  that  DQn{i{n).  £n)\  jjn,  ■  •  •)  =  0  we  may  write 

DQ»~i(l{n).  r ’  ■  W  •  ■)  =  DL„Xin{t(n])  =  S(i„viln  •  •  • ;  g< n»)  (3.71) 

For  exponential  families,  the  second  derivative  of  Q  will  provide  the  Fisher  information,  i.e. 

02gn.i(«‘n,,g("';sf^1,  -)  =  -^lx,  (3.72) 

Thus  the  approximation  (3.70)  may  be  written  as 

For  the  case  where  the  observations  are  coming  from  a  finite  markov  source,  the  Fisher 
information  is  written  recursively  as  a  sum  of  identical  conditional  information  matrices.  We 
may  again  use  the  results  of  Sacks  491  and  Fabian  46!  to  get  Normality  and  y/n  consistency 
of  the  estimator,  as  in  (3  67),  where  the  conditional  information  matrices  and 

fy,.,  v„  replace  the  matrices  /*  and  Iy 
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Chapter  4 


Parameter  Estimation  of 
Superimposed  Signals 


In  this  chapter,  we  will  consider  the  problem  of  estimating  parameters  of  superimposed 
signals  observed  in  noise,  which  occurs  in  a  wide  range  of  signal  processing  applications. 
This  problem  will  be  approached  in  this  chapter  using  the  iterative  EM  method,  presented 
in  the  previous  chapters.  In  the  next  chapter,  we  will  apply  the  proposed  iterative  method 
to  another  important  signal  processing  problem,  namely,  the  multiple  microphone  noise 
cancellation  problem. 

A  specific  example  of  the  applications,  that  are  considered  m  this  chapter,  is  the  multiple 
source  location  estimation  problem,  using  an  array  of  sensors.  In  this  problem,  we  have  K 
sources  radiating  signals  towards  an  array  of  M  sensors,  as  illustrated  m  Figure  4  1.  The 
location  of  the  sensors  is  known,  and  we  want  to  use  the  relative  time  delay  between  the 
observed  signals  from  the  different  sensors  to  estimate  the  location  of  the  sources.  The 
signals,  received  by  the  array  sensors  due  to  the  kth  source,  may  be  represented  as  the 
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Figure  4.1:  Array-Source  Geometry 

vector  signal,  a*(t),  where  the  mth  component  of  this  vector  is  the  signal  received  by  the 
mlh  sensor.  The  vector  signal,  &k(t),  is  dependent  upon  a  vector  of  unknown  parameters, 
9k,  associated  with  the  kth  source  and  is  denoted  9k).  In  our  problem,  fa,  is  the  vector 
of  unknown  source  location  parameters.  We  will  denote  by  the  vector  g(t)  the  total  signals 
observed  by  the  array  sensors.  This  observed  signal  vector  is  a  result  of  superimposing  the 
various  $k(t.  9t)  and  an  additive  noise  vector,  i.e. 

K 

jf(«)  =  «*(*- 2*)  a(*)  (41) 

*=i 

Our  problem  is  to  estimate  the  location  parameters,  given  the  observations,  g(t). 

The  general  problem  of  interest  in  this  chapter  is  characterized  by  the  model  (4.1). 
The  basic  structure  of  (4.1)  applies  to  a  wide  range  of  signal  signal  processing  problems, 
in  addition  to  the  multiple  source  location  estimation  problem.  Consider,  for  example, 
the  problem  of  multi-echo  time  delay  estimation.  In  this  case  each  signal  component  is 
the  scalar  «*(!;£*),  representing  the  k**  echo  signal,  and  the  parameters  it  are  the  time 
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delay  and  attenuation  of  the  k‘ *  echo.  Another  example  is  frequency  estimation  of  multiple 
sinusoids  in  noise,  where  the  unknown  parameters,  9k,  are  the  amplitude  and  the  frequency 
of  the  kt>l  sinusoid. 

Using  the  mathematical  model  of  (4.1)  together  with  a  stochastic  model  for  the  various 
signals,  one  can  formulate  a  statistical  maximum  likelihood  problem  for  the  underlying  real 
problem.  In  many  cases,  the  direct  solution  of  this  ML  problem  is  complicated.  Using  the 
EM  algorithm,  we  will  develop  in  this  chapter  computationally  efficient  schemes  for  the 
joint  estimation  of  9lt  92,  •  •  • , 9K.  The  idea  is  to  decompose  the  observed  signal,  into 
its  components,  and  then  to  estimate  the  parameters  of  each  signal  component  separately 
Stating  this  idea  in  the  terminology  of  the  EM  algorithm,  we  choose  the  complete  data 
to  be  the  contribution  of  each  signal  component  separately.  Thus  the  algorithm  iterates 
between  decomposing  the  observed  data,  i.e.  estimating  the  complete  data  using  the  cur¬ 
rent  parameter  estimates,  (the  E  step),  and  updating  the  parameter  estimates,  having  the 
decomposed  signals,  (the  M  step). 

So  far  the  superimposed  signals  problem  has  been  stated  in  its  most  general  form. 
In  different  applications,  additional  specific  modeling  assumptions  are  needed.  In  a  large 
variety  of  problems,  we  may  assume  that  the  noise  signals,  n(r),  are  sample  functions 
from  a  stationary  Gaussian  process  with  a  given  spectrum.  However,  the  modeling  of  the 
signal  components  in  (4.1)  varies  according  to  our  a-pnon  knowledge  and  the  nature  of  the 
underlying  real  problem.  Generally,  these  signals  may  be  deterministic  or  stochastic,  and 
various  constraints  may  be  applied  on  their  waveforms  or  on  their  power  spectra. 

The  problem  of  parameter  estimation  of  superimposed  signals  in  noise  and  its  solution 
via  the  EM  algorithm  is  presented  in  this  chapter,  in  a  variety  of  situations,  as  follows.  In 
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section  4.1,  we  will  present  the  statistical  ML  problem  and  its  EM  solution  procedures  in  the 
deterministic  signals  case.  In  section  4.2,  a  similar  presentation  is  given  for  the  stochastic 
(Gaussian)  signals  case.  These  procedures  are  then  used,  m  sections  4.3  and  4.4.  to  solve  the 
multipath  time  delay  estimation  problem  and  the  passive  multiple  source  location  estimation 
problem.  We  will  conclude  this  chapter,  m  section  4.5.  by  presenting  sequential  and  adaptive 
algorithms,  and  applying  these  algorithms  to  the  problem  of  estimating  the  frequencies  of 
multiple  sinusoids  in  noise. 


4.1  Parameter  estimation  of  superimposed  signals:  The  de¬ 
terministic  case 

The  signal  components.  s4(l,04),  are  naturally  modeled  as  deterministic  signals  in  a 
variety  of  applications.  Consider  for  example  an  active  radar  or  sonar  environment,  where 
a  known  waveform  pulse  is  transmitted.  We  observe  the  echoes  of  this  pulse  returning 
from  several  targets  Assuming  perfect  propagation  conditions,  the  observed  signal  is  a 
result  of  superimposing  deterministic  signals  (the  pulse  echoes),  which  are  known  up  to 
some  parameters  (e  g.  the  time  delay).  A  statistical  problem  for  estimating  the  unknown 
parameters  of  the  superimposed  signals  is  achieved  in  this  case,  when  a  stochastic  model 
for  the  noise  components  of  (4.1)  is  assumed. 

In  this  section,  we  will  present  this  statistical  maximum  likelihood  problem  and  show 
that  its  direct  solution  is  complicated,  even  in  the  simplest  case,  when  we  assume  that  the 
noise  is  white  Thus,  we  will  develop  methods  based  on  the  EM  algorithm  to  solve  this 
problem. 
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4.1.1  The  ML  problem 


Consider  the  model  of  (4.1)  under  the  following  assumptions: 

•  The  signal  vectors.  sk(t:  9k)  k  =  1,  •  •• .  K  ,  are  conditionally  known  up  to  a  vector 
of  parameters.  9k 

•  The  n(t)  are  vector  zero-mean  white  Gaussian  processes  whose  covariance  matrix  is 

£  =  Q  ■  6(t  -  a) 

where  Q  is  a  positive  definite  constant  matrix  and  S()  is  the  impulse  function. 


•  The  signals  are  observed  over  a  finite  duration,  say  T,  <  t  <  Tf. 


Under  these  assumptions,  the  log- likelihood  function  is  given  by, 

x  rTt  k  r  * 

log /£(#:$)  =  C  -  -  I  %(t)  -  Q  1  y(0  -  ]T  $*(*:£*)  dt  (4.2) 

"r-  *=i  t=i 

where  *  denotes  the  conjugate  transpose  operator  A  =  1,  if  n(t)  is  real  valued,  A  =  2,  if 
n(f )  is  complex  valued  C  is  a  normalization  constant  independent  of  9  This  result  is  just  a 
straightforward  multi-channel  extension  of  the  known  (deterministic)  signal  in  white  noise 
problem  (  1  .  chap  4)  If  the  observed  signal  is  discrete  i  e.  we  observe  y(t,),  i  =  1,  -  ,  .V 

the  log-likelihood  ;s  still  given  by  (4.2),  where  the  integral  over  f  is  replaced  by  the  sum 
over  the  t,  s. 


Thus,  the  joint  ML  estimation  of  the  £*'s  is  obtained  by  solving 

4  j,  ’  f(  1  t 

min  /  *(t)  '  E  Q"‘  Jf(0  -  E  !*(*;  it)  * 

-  -*  JT'  J  L  *=i 

Or  for  discrete  observations. 

v  r  k  '  r  K 

min  £  &(*•)  -  E  !*(*.  £*)  Q~l  *(0  ~  E4*^1**) 
-*  <= i  l  *xi  J  L  *=i 


(4-3) 


(4.4) 
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In  either  case,  we  have  a  complicated  multi- parameter  optimization  problem.  Of  course, 
standard  techniques,  such  as  Gauss-Newton  or  some  other  iterative  gradient  search  algo¬ 
rithm.  can  always  be  used  to  solve  the  problem.  However,  when  applied  to  the  problem  at 
hand,  these  methods  tend  to  be  computationally  complex  and  time  consuming. 

We  note  that  the  more  general  problem,  where  the  noise  vector,  q(0>  has  an  arbitrarily 
given  power  spectrum  matrix, .  .V(w),  may  be  reduced  to  the  problem  presented  above, 
where  the  noise  vector  is  white,  by  using  an  appropriate  whitening  filter.  Let  St(t;0k)  be 
the  output  of  the  whitening  falter  to  the  input  st(t;04)-  In  this  case  the  likelihood  of  the 
observations  is  still  given  by  (4.2),  where  we  use  3fc(t;$t)  instead  of  lk(t;2*). 


4.1.2  Solution  using  the  EM  method 


Having  in  mind  the  EM  algorithm  and  the  class  of  iterative  algorithms,  developed  in 
the  first  part  of  the  thesis,  we  want  to  simplify  the  optimization  problem  associated  with 
:b?  direct  ML  approach. 

In  order  to  apply  an  EM  algorithm  to  the  problem  at  hand,  we  need  to  specify  the 
ompietc  data.  A  natural  choice  of  a  complete  data,  j(0>  is  obtained  by  decomposing  i/(t) 
nto  ts  signal  components,  that  is 


1(0  = 


ii(0 
i:(  0 


i*(  0 


(4.5) 


*  «a(‘t*)  -  a*(0 


40 


(4.6) 


and  the  nk{t)  are  obtained  by  arbitrariiv  decompoeing  the  total  noise  signal  <|(<)  nto  h 
component*.  so  that 

K 

^  ti.i f)  =  3i 1 1  4 

«*t 

From  (4  1!.  (4  6 1  and  14  Tl  the  relation  between  the  '-ompiete  data  {( 1 1  and  the  ncom- 
plete  data  g(t)  i*  given  by 

K 

*<f)  *  £  {*(»>  =  H  {(f)  ;«  9* 

«» i 

where 

K  terms 

H  =  11  I 

W e  will  find  it  moat  convenient  to  chooee  the  nfe(t)  to  be  statistically  independent  tero- 
mean  and  Gauaaian  with  a  covariance  matrix 
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and  it  ••  the  mean  of  g(t  I  The  matrix  \  it  the  covariance  matrix  of  g(f). 


Qt 


\  = 


Qk 


(4  12) 


The  notation  in  |4  12)  indicate*  that  A  n  a  block-diagonal  matrix  Thus,  the  log-likelihood 
of  the  complete  data  may  be  written  as 

«*(r)-*s(M)W*e(«)-le(*;l)!*  (413) 

*  «>i  JT- 

From  this  expression,  we  notice  that,  if  the  complete  data  was  available,  then  maximizing 
its  likelihood  with  respect  to  $  is  equivalent  to  the  minimization  of  each  of  the  terms  in  the 
sum  above  separately,  which  is  simpler  than  solving  a  multi-variable  optimization  problem 
with  respect  to  all  9k  s  at  once  We  also  notice  that  the  sufficient  statistics  of  the  complete 
data  contain  only  linear  terms,  since  the  quadratic  terms  in  (4.10)  are  independent  of  the 
unknown  parameters. 

Of  course,  we  do  not  observe  the  complete  data.  However,  we  take  advantage  of  the 
special  structure  of  the  likelihood  of  the  complete  data  by  using  an  EM  algorithm  with  this 
specification  of  complete  data.  This  EM  algorithm  will  iterate  between  estimating  z(t)  and 
using  the  estimated  value  in  (4.13)  to  updating  the  parameters  by  a  separate  optimisation 
with  respect  to  each  0* . 

More  specifically,  from  (2.19),  an  EM  iteration  is  summarized  by 


iin+l)  =  argma xQ(g;g<n>)  =  argmax  £  {log /jt(i;  g)  *,g<">  }  (4.14) 
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However,  in  this  cue 


arg  max  Q{t,  f "')  =  arg  max 


(4.15) 


where 


*£{*(*)/*.  |  =  a(«:  €*"’)  '  AM r  •  -  H  •  lM‘n))]  (4.16) 


Substituting  (4.11)  and  (4.12)  in  (4.15)  and  following  straight  forward  matrix  manip* 

4 

ulations,  we  see  that  the  maximization  of  (4.15)  is  equivalent  to  the  minimization  of  the 
sum, 

™n  £  L '  [**“(*)  ~  *  (417) 

-i  Ik  t  J T,  L  J  L 

where  i*"1  is  the  4ctA  component  of  j(n)  This  minimization  of  the  sum  is  equivalent  to  the 
minimization  of  each  of  its  components  separately,  with  respect  to  9k. 

Also,  substituting  (4.12)  m  (4.15),  the  gain  matrix  becomes 


AH'IHAH')-1  =diag(0i,^,---,JK)  (4-18) 


where  diag( •  ■•)  indicates  a  diagonal  matrix. 

Summarizing  all  these  relations,  we  may  now  write  the  E  and  M  steps  of  the  EM 
algorithm  for  this  problem  as  follows: 


#  The  £  step:  For  k  =  1, 2,  •  •  • ,  K  compute 


2( 

4=1 


where  the  £*’ s  are  any  real*vaiued  positive  scalars  satisfying 

K 

£  A  =  l 

4s  l 


(4.19) 
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Figure  4.2:  The  EM  algorithm  for  Deterministic  (known)  signals 

•  The  M  step:  For  k  =  1 , 2,  •  •  • .  K 

=  arsmn  '  [j^n|  -  £*(t.  g*)] f  Q^x  [l^'  -  £*(«;£*)]  dt  (4.20) 

We  observe  that  0*n'rl)  is  in  fact  the  ML  estimate  of  9k  based  on  j^"1.  This  algorithm 
is  illustrated  schematically  in  Figure  4.2.  We  note  that  in  the  case  of  discrete  observations, 
the  integral  of  (4.20)  is  replaced  by  the  sum  over  the  points  {*,}. 

The  most  striking  feature  of  this  algorithm  is  that  it  decouples  the  complicated  multi* 
parameter  optimisation  into  k  separate  ML  optimizations.  Hence,  the  complexity  of  the 
algorithm  is  essentially  unaffected  by  the  assumed  number  of  signal  components.  As  K 
increases,  we  have  to  increase  the  number  of  ML  processors  in  parallel;  however,  each  ML 
processor  is  maximized  separately.  Since  the  algorithm  is  based  on  the  EM  method,  each 
iteration  cycle  increases  the  likelihood  until  convergence  is  accomplished. 
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Unknown  signal  waveforms 


The  algorithm  developed  above  assumes  that  the  signal  waveforms.  ik{t).  are  known  a- 
prion  In  practice,  one  is  unliksly  to  have  a  detailed  prior  knowledge  of  these  waveforms 
in  which  case  they  must  be  estimated  jointly  with  the  parameters.  We  will  consider 
the  samples  of  the  unknown  waveforms  as  additional  parameters;  using  the  same  statistical 
formulation  as  above,  we  will  get  an  ML  problem  for  estimating  the  waveforms.  Following 
the  same  considerations  as  above,  we  can  specify  the  E  and  M  steps  of  an  EM  algorithm 
for  estimating  the  waveforms  and  the  parameters,  as. 


•  The  C  step:  For  Ir  =  1,2 ,  ■  ,  K  compute 


<*i 


(4  21) 


•  The  M  step:  Minimize 

£"■ 

*=xl 

with  respect  to  $,.•••.£*  and  i^t), •••,**(*)• 


£  C'  !*inl  -  !*(»•«*)]  'qV  [C  -  dt  (4  22) 

I  ^  *  I  1 


The  E  step  is  identical  to  (4.19),  where  instead  of  using  the  a-prion  given  waveforms,  we 
use  the  current  estimated  waveforms,  gj^(f).  The  M  step  requires  a  more  complicated 
maximization.  However,  we  will  be  able  to  give  an  explicit  example  for  this  M  step  later  in 
the  chapter. 

We  note  that  the  ML  problem  for  estimating  the  waveforms  in  addition  to  the  unknown 
parameters  is  ill- posed,  since  there  may  be  too  many  unknowns.  To  make  the  problem  well- 
posed,  we  have  to  incorporate  some  constraints  on  the  possible  signal  waveforms.  However, 
we  have  to  make  sure  that  these  constraints  will  correspond  to  the  real,  physical  situation. 
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4.1.3  The  oxtandod  CM  algorithm  and  tha  choka  of  tba  J*‘s 

Th«  CM  algorithm  preeeniad  above  correaponda  to  *  family  of  complata  data  definition* 
Each  choice  of  the  J*  »  impliaa  a  different  cho*ca  of  complata  data.  *(»)  tha  probatxm* 
lamplt  space  and  tha  corresponding  pdf  of  s(t)  dapand  on  tha  rhoica  of  tha  Jt  •  Tha 
convenient  feature  of  this  family  of  complata  data  daflmtioaa  ia  that  aach  member  of  tha 
family  lalidw  tha  mom  relation  between  complata  and  incomplete  data  given  by  14  •  ) 
Thia  faatura  allowed  tha  praaantatioa  of  tha  EM  alfonthm  for  tha  aatira  family  at  once 
This  family  of  complata  data  definitions  may  ba  furthar  extended.  while  keeping  tha 
simple  structure  of  tha  algorithm  uapa  it  19)  and  (4  20).  in  tha  following  way  Wa  could 
modal  tha  complata  data.  g(<).  aa  a  Gauaaian  procaaa.  whoaa  maaa  ia  |(t.f)  aa  in  (4  11). 

but  whoaa  vananca  ia  tima  dapandant  and  given  by. 

»  1  r 

Qi(t)  J\WQ 

<?}(«) 

A(f)  3  =  (4  23) 

<?*(*)  3k(*)Q 

where  Jk(t)  are  arbitrary  real  valuta,  aatiafying  for  all  t 

K 

£  3k(t)  =  1,  3k(t)  >0  ft  T,  <  t  <  Tf  (4  24) 

*>i 

Any  mambar  of  thia  axtandad  family  of  complata  data  definitions  corresponds  to  decom- 
poaing  tha  observation  noiaa,  3(f),  into  statistically  independent  zero-mean  Gauaatan  non- 
stationary  components,  n*(f),  whose  covariance  matrix  ts 

E {n*(t)o*(e)}  *  Qk(t)  •  6(t  -  9)  =  jk(t)Q6 (t  -  9) 
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The  £  step  of  the  EM  algorithm  for  each  mrmbrr  of  this  extended  family  is  similar  to 
4  19).  except  that  i*(l)  s  replace  the  tiirw  invariant  i*  s  Th«  M  sttp  is  similar  to  it  20) 

•  bars  Again  tha  tima  varying  Q*U)  s  replace  tha  tima  invariant  matrices  Qk  s 

From  tha  discussion  above  w»  saa  that  tha  suggested  class  of  EM  algorithms  has  many 
Jag  rasa  of  fraadom  Wa  may  look  for  a  choice  of  i*(l)  s  that  giva  tha  simpisst  and  fastast 
algorithm  Furthermore  following  tha  discussion  in  saction  2  4.  mstaad  of  fixing  thischoica. 
tha  J»(t)  t  may  vary  from  itaration  to  i  tar  at  ion  according  to  soma  a- priori  ruia  or  dapanding 
on  tha  currant  paramatar  vaiusa.  f'**'  Tha  EM  algorithm,  wham  tna  complata  data  defi- 
mtion  v arias  from  itaration  to  itaration.  has  boon  rafarrad  to  as  tha  axtandsd  EM  (EEM) 
algorithm  Examplas  for  applying  axtandsd  EM  algorithms  to  our  problam  ara  now  givan. 

Supposa  that  in  soma  itaration  onaof  tha  i*  s.  say  i(,  is  chosan  to  ba  unity;  tha  remaining 
J»  s  must  ba  taro  In  tha  naxt  iteration,  wa  will  choose  if. (  to  ba  unity  and  so  on  (in  a 
cyclic  way  so  that  after  J *  is  unity,  it  will  ba  unity)  SuDstituting  these  i»  s  in  (4  19) 
and  (4  20).  wa  notice  that  the  resulting  algorithm  is  equivalent  to  a  coordinate  search  or 
alternate  maximisation  algorithm  of  (4  2)  In  each  iteration.  i'*  11  =  for  all  It's  that 
correspond  to  a  zero  is,  and  it  is  updated  by, 


*l»l)  f^r 

=  arg  min 

• 

? 

"i  J  T, 

*»r 

(4.25) 


where  l  corresponds  to  tha  unity  3* 

While  in  the  previous  example,  we  have  shown  how,  by  varying  the  complete  data, 
the  EM  algorithm  has  been  reduced  to  a  simple  (but  not  necessarily  efficient)  algorithm, 
we  will  now  show  how  an  algorithm  with  a  superlinear  convergence  rate  may  be  achieved. 
To  simplify  the  exposition,  we  will  discuss  a  degenerate  scalar  case,  where  the  unknown 
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parameters  ui  given  by  the  scalar  9 

From  (2 .65),  to  order  to  achieve  a  superlinear  convergence  rate,  we  have  to  chooee.  in 
each  iteration.  J*  s  (or  complete  data)  that  are  the  solution  to  the  equation. 


DllQ(t[n  1  #,w')  = 


M =° 

v  *  *  j  .41* »  4_-4i«l 


(4.26) 


Following  (4  15), (4  16)  and  14  17).  the  expression  for  <?(tfi,#t)  in  this  case  is  given  by, 

K  fr,  t 

Q(lx.h)  «c- V  /  <?*_,(0  [j^Ms) ~4*Mi )!*]  (4.27) 

where  j*nl(t;  0j)  is  given  by, 

M  =  i*(<;  h)  -  MO  [*(0  -  £  i*(t;  •») 


(4.28) 


Thus,  a  possible  solution  of  (4.26)  is  to  choose 


•MO 


(4.29) 


Eft.  *«*(«;  •)!„., 

If  this  choice  of  MO  is  allowed,  the  convergence  rate  of  the  resulting  EM  algorithm  will  be 
superlinear 

Another  desired  feature  of  an  EM  algorithm  with  varying  complete  data  is  that  it  may 
avoid  convergence  to  unwanted  stationary  points.  Following  the  discussion  in  section  2.4, 
the  simplest  procedure  is  to  randomly  choose  MO  in  each  iteration.  These  randomly  chosen 
MO  have  to  satisfy  the  constraints  of  (4.24),  however.  A  more  complicated  procedure  is 
to  search  for  the  choice  of  MO  m  the  domain,  defined  by  the  constraints  of  (4.24),  that 
will  give  the  largest  increase  in  the  likelihood.  Since  searching  the  entire  domain  of  possible 
MO  may  be  too  complicated,  we  will  search  only  in  a  sub-domain,  which  is  randomly 
chosen,  in  each  iteration. 
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4.2  Parameter  estimation  of  superimposed  signals:  The  stochas¬ 
tic  Gaussian  case 

in  the  previous  section,  the  signal  components,  4*(t;g*),  were  deterministic,  in  this 
section,  we  will  present  the  statistical  maximum  likelihood  problem  and  its  solutions  using 
the  EM  algorithm  for  the  case  where  the  signal  components,  ik(t;  0*),  of  the  observed 
composite  signal  are  modeled  as  sample  functions  from  a  Gaussian  stochastic  processes. 

This  modeling  is  natural  in  a  variety  of  applications.  Consider,  for  example,  a  passive 
sonar  environment,  where  the  targets  generate  noise-like  acoustic  signals.  The  signals  from 
several  targets  are  superimposed  and  measured  by  our  array  sensors  with  an  additional 
background  noise.  We  may  or  may  not  know  the  spectral  characteristics  of  the  targets’ 
signals.  However,  we  are  usually  interested  in  finding  the  geometrical  parameters,  i.e.  the 
location  or  the  bearing  of  the  targets. 

By  assuming  that  the  signal  components  and  the  background  noise  are  Gaussian  pro¬ 
cesses,  we  get  a  statistical  maximum  likelihood  problem  for  estimating  the  unknown  pa¬ 
rameters  (which  are  the  geometrical  parameters  and  maybe  some  spectral  parameters  of 
the  signals  in  the  example  above).  It  is  difficult  to  solve  this  statistical  problem  directly; 
indeed,  in  many  applications,  suboptimal  procedures  were  suggested.  We,  however,  will 
present  in  this  section  procedures,  based  on  the  EM  algorithm  and  its  extensions,  whose 
goal  is  to  be  optimal,  i.e.  to  solve  this  maximum  likelihood  problem. 

4.2.1  The  ML  problem 

Consider  the  model  of  (4.1)  under  the  following  assumptions: 
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•  The  signals  &*(<;£*)  k  =  K  are  mutually  uncorrelated,  wide  sense  station¬ 

ary  (WSS),  zero-mean,  vector  Gaussian  stochastic  processes,  whose  power  spectrum 
matrices  are  S*  (w ;  0* ) ,  k  =  1 . 2.  •  • .  K . 

•  The  noise,  q(Z),  is  a  WSS.  zero-mean,  vector  Gaussian  processes  with  a  given  power 
spectrum  matrix.  ,V(w). 

•  The  observation  interval.  T  =  T/  -  7,,  is  long  compared  with  the  correlation  time 
(inverse  bandwidth)  of  the  signals  and  the  noise,  i.e.  WT !frc  >>  1. 

Under  the  above  assumptions,  the  observed  signals,  g(t),  are  also  WSS,  zero-mean  and 
Gaussian.  WSS  processes  with  a  long  observation  time  are  conveniently  analyzed  in  the 
frequency  domain.  Fourier  transforming  jj( t )  we  obtain 

HM  =  r  ^  •  t  (4.30) 

Vi  JT,  1 

For  WT/2*  >>  1,  the  iKw<)’s  are  asymptotically  uncorrelated,  zero-mean  and  Gaussian 
with  the  covariance  matrix  P(w<;g),  where  P{u\9)  is  the  spectral  density  matrix  of  jt(t) 
given  by, 

K 

P(w;g)  =  ^Sk(w;g*)^iV(w)  (4.31) 

The  log-likelihood  function  observing  the  L(^r)'s  13  therefore  given  by, 

m  =  H  [logdetw/>(w<;g)  ^  51T(^)  •  *  il(wc)]  (4.32) 

< 

where  the  summation  in  (4.32)  is  carried  over  all  w;  in  the  signal  frequency  band.  In  the 
case  of  discrete  observations,  the  log-likelihood  is  still  given  by  (4.32),  where  the  £(we)’* 
are  the  discrete  Fourier  transform  (DFT)  of  the  observed  signals. 
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In  either  ewe,  to  obtain  the  ML  estimate  of  the  various  f*  s  we  must  solve  the  following 
joint  optimization  problem 

mm  £  [logdetP(wt,0)  -  L^e)  •  ■  L(~f)j  (4  33) 

Ik  i  1  i 

This  is  usually  a  complicated  joint  optimization  problem.  Standard  search  techniques, 
such  as  gradient  or  Newton- Raphson  methods,  tend  to  be  complex,  when  applied  to  this 
problem.  Thus,  we  will  propose  using  the  EM  method  to  by-pass  this  complicated  multi¬ 
parameter  optimization. 


4.2.2  Solution  using  the  EM  method 


Following  the  same  considerations  as  in  the  deterministic  signal  case,  a  natural  choice 
of  complete  data,  j(t),  will  be  obtained  by  decomposing  the  observed  signal,  y(t),'  into  its 
signal  components.  Thus,  repeating  equations  (4.5)  and  (4.6),  the  complete  data.  f(t).  is 


given  by, 


where 


(4.34) 


is(0  =  *»(<;£*)  *  a*(0 


(4.35) 


Again,  the  a *(t)  are  chosen  to  be  mutually  uncorrelated,  zero-mean  and  Gaussian,  whose 
spectral  density  matrices  are  iVfc( w)  =  3k  •  ^V(w),  where  the  3 it's  are  arbitrary  real-valued 
constants  subject  to  (4.9).  Thus  the  relation  between  complete  and  incomplete  data  is  given 
again  by  g(t)  =  H  ■  z(f)  as  in  (4.8). 
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The  log- likelihood  of  the  compku  data  g(t)  is  given  by 


MO  =  log  fx\ l  g)  -  -  31  log  det  »A|-,  f)  -  \  o  i,  -,I  4  36 

~  -  31  logdet  »A|-«  f)  -  tr  j  A  lU,  #1 

wh«T»  £(~t)  is  obtained  bv  Fourier  transforming  g(t  k  *ny  of  itt  components  , 

is  given  by. 

I  ;T  .  3t 

4  3* 


2Cel-»>  =  I'  W >«— ‘it  f 


Tba  matrix  Al«,.g)  is  tha  power  spectrum  density  matrix  of  the  complete  data  It  s  4 
block  diafoaal  matrix  given  by 


A|| f< ) 


I 


M-a  0  =  i 


Ail-,  <,) 


4  38 ' 


where 


U  -r  *k  ' 


A*(w.|*)  =  S*U.f)-  J*  \(.) 


(4  39) 


Exploiting  the  block  diagonal  form  of  A(w.  0-  tha  likelihood  of  the  complete  data  may 
be  written  aa, 

K 

MI)  —  -  ^  ^  ^logdet  *A*(wi;  tk)  -  tr  |  1  (w,,  ik)  ■  £k(+it)  JCi(-r) )  1  (i  40) 

«=i  < 

From  this  expression,  w»  notice  that,  if  the  complete  data  was  available,  maximizing  its 
likelihood  with  respect  to  9  is  equivalent  to  minimizing  each  of  the  terms  in  the  sum  above 
with  respect  to  g*  separately.  This  is  much  simpler  than  solving  a  multi- variable  optimiza- 
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uon  problem  with  respect  to  *11  9k  s  at  once  The  sufficient  statistics  of  the  complete  data 
is  composed  of  the  quadratic  terms  t 

In  the  suggested  EM  algorithm,  we  take  advantage  of  this  simple  form  of  the  likelihood 
of  the  complete  data.  The  E  step  will  estimate  the  required  quadratic  statistics  and  the  M 
step  will  uee  the  estimated  statistics  in  (4  40)  and  update  the  parameters  via  the  simple 
maximisation .  associated  with  the  likelihood  of  the  complete  data. 

The  required  statistics  are  the  diagonal  blocks  of  the  matrix  2C(w<)ir(wt)  The  relation 

« 

between  complete  and  incomplete  data  is  linear  (i.e.  Y.(~t)  =  H  •  J£(wf))  and  the  data  is 
Gaussian  Thus,  using  the  results  developed  for  the  linear  Gaussian  case  (see  (2.55)),  the 
conditional  expectation  of  the  matrix  JC(-^e)^SCr(we) ,  given  the  observations,  H(wj),  and  an 
assignment.  #  to  the  parameters,  is  given  by 

*Ue>  =  SjJUwrlJCVr)  O-r.f} 

=  /-ru  t)  H\  A(wi.r) -Hw;  £*)!:(-<)  •Ht(-/<)r,(«;«')  (4.41) 

where  r(w.f)  is  the  'Kalman  gain" 

r(w.f)  =  A(-,,#')//t  jffA(-r. 

Using  straight  forward  algebraic  manipulations  the  (fc.fc)  block  of  is  given  by 

♦*(*<)  =  A*(-<,^)-A*(w(,g;)P"‘(w<;?')Ak(w<,r*)- 

-A*(w<,r*)^-,(wr.On^)-t,M/>-l(w<1f')A*(wr,#,4)  (4.42) 

where  P(w<;£)  >s  defined  by  (4.31). 

These  estimated  statistics  are  used  instead  of  the  unavailable  statistics  of  the  complete 
data.  The  maximization  in  the  M  step  will  be  equivalent  to  K  separate  minimizations  with 
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respect  to  each  £*.  The  E  and  M  steps  of  the  EM  algorithm  for  the  Gaussian  superimposed 
signal  problem  may  be  summarized  as  follows: 

•  The  E  step:  For  k  =  1, 2,  •  •  • ,  K  compute 

=  A*(W<;g‘n>)-  (4.43) 

•  The  M  step:  For  *  =  1, 2,  ••  • ,  K 

g*n‘",,*rgmin  £  [logdet  A*(w<;g*)  -(-  tr  { A*  g*)  *  ♦**)(w«)}  ]  (4.44) 

We  observe  that  £4*"" 1  *  is  the  ML  estimate  of  £*,  where  2C*(w<)2Cfc(w<)  >»  replaced  by  its 
current  estimate,  ♦^(wr).  The  algorithm  is  illustrated  in  Figure  4.3.  The  most  attractive 
feature  of  the  algorithm  is  that  it  decouples  the  full  multi-dimensional  optimization  of 
equation  (4.33)  into  optimizations  in  smaller  dimensional  parameter  subspaces.  As  in  the 
deterministic  signal  case,  the  complexity  of  the  algorithm  is  essentially  unaffected  by  the 
assumed  number  of  signal  components.  As  K  increases,  we  have  to  increase  the  number 
of  parallel  ML  processors;  however,  each  ML  processor  operates  independently.  Since  the 
algorithm  is  based  on  the  EM  method,  each  iteration  cycle  increases  the  likelihood  until 
convergence  is  accomplished. 

As  in  the  deterministic  case,  this  EM  algorithm  corresponds  to  a  family  of  complete 
data  definitions.  A  specific  member  of  the  family  is  associated  with  a  specific  choice  of 
the  3*  s.  This  family  of  complete  data  definitions  can  be  extended  by  allowing  a  different 
choice  of  3k*  in  each  frequency.  The  EM  algorithm  for  any  member  of  this  extended  family 
will  keep  the  structure  of  the  algorithm  steps  (4.43)  and  (4.44),  where  3k(^i)  is  used  in  the 
definition  of  A*(*).  (4.39). 
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Figure  4.3:  The  EM  algorithm  for  stochastic  Gaussian  signals 

We  may  choose  a  fixed  complete  data  definition  throughout  the  algorithm  iterations,  or 
we  may  vary  the  complete  data  definition  from  iteration  to  iteration.  Varying  the  complete 
data  within  this  family  of  complete  data  definitions  will  correspond  to  choosing  different 
3k(ujt)'s  in  each  iteration,  but  otherwise  the  algorithm  steps  remain  the  same.  Possible 
strategies  for  choosing  the  complete  data  and  varying  it  from  iteration  to  iteration  were 
discussed  in  Chapter  2  and  previously  in  this  chapter,  for  the  deterministic  signal  case. 
These  discussions  are  relevant  in  this  case  too. 

4.3  Application  to  multipath  time-delay  estimation 

Time  delay  estimation  is  a  common  problem  in  underwater  acoustics  as  well  as  in  radar. 
Geometrical  parameters  (such  as  range  and  location  of  targets)  and  physical  parameters 
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(such  as  velocity  and  temperature  profiles  of  the  ocean)  are  typically  found  via  time  delay 
analysis.  Consider,  for  example,  the  ocean  tomography  experiment,  described  in  [30)  and 
[51).  An  acoustic  transducer,  located  at  a  given  point  in  the  ocean,  transmits  a  signal  which 
is  received  time-delayed  by  known  location  sensors.  The  estimated  delay  times  between  the 
transmitted  signal  and  the  received  signals  are  used  as  an  input  to  an  inverse  problem  that 
finds  the  ocean  profile  in  the  experiment  area. 

Multipath  may  occur  due  to  reflections  and  propagation  modes.  The  received  signal  in 
this  case  contains  several  echoes  of  the  transmitted  signal  having  different  time-delays  and 
attenuations,  i.e.  it  may  be  written  as 

K 

y(<)  =  ]C  a**(*  - r*) +  »(0  (4-45) 

k=i 

The  existence  of  more  than  one  path  is  undesired  in  some  cases;  in  the  ocean  tomography 
experiment,  the  additional  echoes  interfere  with  and  corrupt  the  interesting  direct  path 
signal.  However,  in  other  cases,  additional  important  information  may  be  obtained  from 
finding  the  time  delay  of  the  other  paths.  A  single  sensor  may  determine  the  range  and  the 
depth  of  a  target,  if  we  can  find  the  delay  times  of  the  direct  path  and  of  the  paths  that 
result  from  a  single  bottom  or  surface  reflection. 

We  will  be  interested  in  this  section  in  estimating  the  delay  times  of  the  multipath  signal 
(4.45).  In  a  variety  of  applications,  we  may  model  the  components  of  the  multipath  signal  as 

deterministic  or  stochastic.  In  applications  such  as  ocean  tomography,  active  sonar/radar 

* 

and  many  more,  a  deterministic  known  waveform  signal  (pulse)  is  transmitted.  In  a  passive 
determination  of  range  and  depth  of  a  target  by  a  single  sensor,  the  target  may  generate  a 
noise-like  acoustic  signal,  naturally  modeled  as  a  sample  signal  from  a  stochastic  Gaussian 
process.  In  both  cases,  we  will  be  able  to  apply  the  results  of  the  previous  sections  to  obtain 
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an  estimate  of  the  delay  times  via  the  EM  algorithm. 

We  will  present,  in  this  section,  the  detailed  algorithm  and  experimental  results  for  the 
deterministic  signal  case.  In  [521,  we  have  presented  the  EM  algorithm  for  multipath  time 
delay  estimation  in  the  Gaussian  signal  case. 


4.3.1  The  deterministic  case 


Suppoee  that  the  observed  signal  is  given  by  (4.45),  where  »(t)  is  a  known  signal  wave* 
form,  the  noise  n(t)  is  Gaussian  with  a  flat  spectrum  over  the  receiver  frequency  band, 
and  the  observation  time  is  T,  <  t  <  T/.  The  problem  is  to  estimate  the  pairs  (a*,  r*)  for 

*=  1,2,--,  K. 


In  this  case,  the  direct  ML  approach  given  by  (4.3),  reduces  to, 

K  ■* 

/  •'  I  ... 

mm 

i"  '  ' 

ksl 

n,  r3,  ‘ 


{t,  JL 

I  V(0  ~  £  <***(*  ~  r*) 

fcsl 


dt 


(4.46) 


“i.  aj,  , 


This  optimization  problem  is  addressed  in  [53) ,  where  it  is  shown  that  the  optimal  a*’s 
may  be  expressed  explicitly  in  terms  of  the  optimal  r*’s.  Thus,  the  2  If -dimensional  search 
can  be  reduced  into  a  If -dimensional  search.  However,  as  pointed  out  in  [53],  for  If  >  3  the 
required  computations  become  too  intensive.  Consequently,  ad-hoc  approaches  and  sub- 
optimal  solutions  have  been  propoeed.  The  most  common  solution  consists  of  correlating 
y(t)  with  a  replica  of  a(t)  and  searching  for  the  K  highest  peaks  of  the  correlation  function. 
If  the  various  paths  are  resolvable,  i.c.  the  difference  between  r*  and  rt  is  long  compared  with 
the  temporal  correlation  of  the  signal  for  oil  combinations  of  k  and  t,  this  approach  yields 
near  optimal  estimates.  However,  in  situations  where  the  signal  paths  arc  unresolvabie,  this 
approach  is  distinctly  sub-optimal. 
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We  identify  the  model  in  (4.45)  as  a  special  case  of  (4.1).  Therefore,  in  correspondence 
with  equations  (4.19)  and  (4.20),  we  obtain  the  following  algorithm: 

•  Start,  n  =  0  ,  initialize  o^0*  and  rj^°\  k  =  1,  •  •  • ,  K. 

•  Iterate  (until  some  convergence  criterion  is  met) 

-  The  E  step:  For  k  =  1, 2,  •  •  • ,  K  compute 

xi"* (0  =  “t’M*  -  r*n))  +  y(t)  -  YL  “c’M*  -  r<">)  (4-47) 

tsl 

where  the  3k' s  ere  any  real* valued  positive  scalars  satisfying 

K 

£>*  =  1 
*sl 

-  The  M  step:  For  k  =  1, 2,  •  •  • ,  K 

Q*n+l)'r*"+,)  =  min  f  '  |i^b)  -  aj(t  -  r)|2  dt  (4.48) 

-  n  =  n+  1 

Assuming  that  the  observation  interval,  T,  is  long  compared  to  the  duration  of  the  signal 
and  to  the  maximum  expected  delay,  the  two  parameter  maximization  required  in  (4.48) 
may  be  simplified,  and  can  be  carried  out  explicitly  as  follows: 


rj"*°  =  argmax  ipin)(r)| 

(4.49) 

J"+‘)  _  2i_J Il_  ' 
k  E 

(4.50) 

T 

where  E  -  fTf  |*(t)|2dt  is  the  signal  energy,  and 

(4.51) 
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Figure  4.4:  The  EM  algorithm  for  multipath  time  delay  estimation 

Note  that  $[n)(r)  can  be  generated  by  passing  x^[t)  through  a  filter  matched  to  s(t).  The 
algorithm  is  illustrated  in  Figure  4.4.  This  computationally  attractive  algorithm  iteratively 
decreases  the  objective  function  in  (4.46)  without  ever  going  through  the  indicated  multi¬ 
parameter  optimisation.  The  complexity  of  the  algorithm  is  essentially  unaffected  by  the 
assumed  number  of  signal  paths.  As  K  increases  we  increase  the  number  of  matched  filters 
in  parallel;  however,  each  matched  filter  output  is  maximized  separately. 

We  note  that  the  algorithm  can  be  extended  to  the  case  where  the  signal  waveform,  s(t), 
is  unknown.  The  general  EM  algorithm  steps  for  the  case  where  the  signal  waveforms  are 
unknown,  are  given  by  equations  (4.21)  and  (4.22).  For  our  problem,  the  E  step  is  similar 
to  (4.47)  where  we  use  the  current  estimated  waveform,  «(")(*)<  instead  of  the  a-priori  given 
«(t).  The  M  step  requires  a  more  complicated  maximisation  with  respect  to  the  unknown 


99 


signal  waveform  values  and  the  unknown  parameters.  Using  an  alternate  maximization 
procedure  for  the  M  step,  the  M  step  of  (4.22)  will  reduce  to, 


rt"+l'  =  »rg  min  f  '  'xjf*  -  os(n,(i  -  r)|*  dt,  k=l,---,K  (4.52) 

» X, 

,'"*»(,)  =  1  (4.53) 

£(«! , 

We  have  discussed  the  unknown  signal  waveform  case  in  [54),  following  the  considerations 


above. 


For  the  case  in  which  the  number  of  signals,  K ,  is  unknown,  several  criteria  for  its  deter¬ 
mination  have  been  developed  in  [551  and  elsewhere.  Usually,  these  criteria  are  composed 
of  the  likelihood  function  above  and  an  additional  penalty  term.  Thus,  as  discussed  in  sec¬ 
tion  2.5,  these  criteria  can  be  incorporated  into  an  EM  algorithm,  similar  to  the  algorithm 


above. 


4.3.2  Simulation  results 

To  demonstrate  the  performance  of  the  algorithm,  we  have  considered  the  following 

example:  The  observed  signal,  y((),  consists  of  three  signal  paths  in  additive  noise, 

3 

»(*)  =  Y  afc4(‘  ~ r*)  *  n(0 

k-l 

where  s(t)  is  a  trapezoidal  pulse 

&  0<  t < 5 

*(«)  =  {  5  <  t  <  15 

4^  15  <  t  <  20 

The  observed  data  consists  of  100  time  samples,  indexed  -40  <  t  <  60.  The  additive  noise 
is  spectrally  flat  with  a  spectral  level  of  **  =  0.025,  so  that  the  post-integration  signal  to 
noise  ratio  (SNR)  is  approximately  16  dB. 


Figure  4.5:  The  observed  data 

The  actual  delays  are 

r,  =0,  r?  =  5,  rs  =  10 

and  the  amplitude  scales  are 

a*  =  l,  *=1,2,3. 

In  Figure  4.5,  we  have  plotted  the  observed  data.  In  Figure  4.6,  we  have  plotted  the 
matched  filter  output  as  a  function  of  the  delay.  As  we  can  see,  the  conventional  method 
cannot  resolve  the  various  signal  paths  and  estimate  their  parameters. 

First,  as  a  reference,  we  computed  the  ML  estimates  by  a  direct  minimization  of  the 
objective  function  (4.46),  using  exhaustive  search.  We  obtained, 

f,  =  0.0117  t2  =  5.0031  fs  =  9.9884 

dt  =  1.1511  dt  =  0.7799  d,  =  0.9471 

The  value  of  the  objective  function  at  the  minimum  (corresponding  to  the  value  of  the 
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Figure  4.6:  The  conventional  matched  filter  response 
log-likelihood  function  at  the  maximum)  is, 

J  -  0.45879 

We  also  computed  lower  bounds  on  the  root  mean  square  (r  m.s)  error  of  each  parameter, 
using  the  Cramer-Rao  inequality.  We  obtained, 

<r(f,)  =  0.028  <r(f2)  =  0.030  <r(r3)  =  0.028 

(f  <7(dti)  =  0.076  <r(d,)  =  0.079  <r(a3)  =  0.07908 

k)  denotes  the  attainable  r.m.s  error  in  the  estimate  of  rk,  and  <r(ak)  denotes  the 
attainable  r.m.s  error  in  the  estimate  of  a*. 

We  have  applied  our  algorithm.  In  Figure  4.7,  we  hive  plotted  the  matched  filter 
response  to  the  various  signal  paths,  as  they  evolve  during  the  iterations.  In  addition 
to  this  experiment,  we  have  tried  this  algorithm  using  several  arbitrarily  selected  starting 
points;  the  algorithm  has  converge,  within  the  Cramer-Rao  lower  bound,  to  the  ML  estimate 
of  all  the  unknown  parameters,  after  10  to  15  iterations,  regardless  of  the  initial  guess. 
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Figure  4.7:  The  matched  filter  response  to  each  signal  path 
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Using  the  asymptotic  efficiency  and  lack  of  bias  of  the  ML  estimates,  we  can  claim 
with  some  confidence  that  the  r.m.s  error  performance  of  the  algorithm  is  the  minimum 
attainable,  characterized  by  the  Cramer-Rao  lower  bound. 

Additional  simulation  results,  which  present  additional  examples  of  the  deterministic 
multipath  time  delay  estimation,  may  be  found  in  [52}. 


4.4  Application  to  multiple  source  Direction  Of  Arrival  (DOA) 
estimation 

Passive  direction  of  arrival  estimation  (DOA)  using  an  array  of  sensors  is  a  common 
problem  in  underwater  acoustics,  radar  and  geophysical  seismic  environments.  Using  an 
array  of  M  spatially  distributed  sensors,  the  bearing  of  a  source,  radiating  toward  the 
array,  can  be  determined  by  estimating  the  phase  differences  or  the  time  delays  among  the 
signals  received  in  the  array  sensors. 

The  standard  technique  for  DOA  analysis  is  known  as  beamforming.  For  any  given 
direction,  the  array  signals  are  delayed  and  added  accordingly,  and  an  output  signal  is 
generated.  The  energy  of  the  output  signal  is  recorded  as  a  function  of  the  direction,  and 
the  DOA’s  estimates  correspond  to  “peaks''  of  this  function.  This  is  an  intuitively  appealing 
approach,  and  indeed,  when  only  a  single  source  exists,  the  maximum  likelihood  method,  in 
a  variety  of  modeling  assumptions,  reduces  to  maximizing  the  beamformer  output.  When 
several  sources  exist,  this  approach  is  nearly  optimal,  if  the  various  signal  sources  are  widely 
separated.  However,  if  the  sources  are  closely  spaced  this  approach  is  distinctly  suboptimai. 

The  radiating  sources  generate  signals,  that  may  be  modeled  as  deterministic  or  stochas¬ 
tic.  In  some  radar  environments,  the  targets  transmit  known  waveform  pulses  which  are 
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received  by  the  antenna  array  of  the  receiver.  Similarly,  a  seismic  pulse  received  by  a  sensor 
array  is  naturally  modeled  as  a  deterministic  signal.  On  the  other  hand,  the  acoustic  signal, 
generated  by  a  target  and  received  by  a  passive  sonar  array,  is  a  noise-like  signal,  which  is 
typically  modeled  as  a  WSS  Gaussian  stochastic  process. 

We  will  concentrate  in  this  section  on  the  stochastic  signal  case.  We  note,  however,  that 
methods  for  multiple  source  DOA  estimation  of  deterministic  signals  via  the  EM  algorithm 
were  presented  in  [561, ,571.  We  will  start  by  presenting  the  general  mathematical  model 
of  the  multiple  source  DOA  estimation  problem.  We  will  then  assume  that  the  signals  are 
Gaussian  processes  and  present  the  resulting  statistical  maximum  likelihood  problem.  The 
solution  of  this  ML  problem,  using  the  EM  algorithm,  will  then  be  presented  in  detail,  and 
we  will  describe  the  simulation  results  of  a  specific  example. 

4.4.1  The  passive  multiple  source  DOA  estimation  problem 

We  will  assume  that  K  spatially  distributed  sources  are  radiating  signals  towards  an 
array  of  .V/  spatially  distributed  sensors.  Assuming  perfect  propagation  conditions  in  the 
medium  and  ignoring  amplitude  attenuations  of  the  signal  wavefront  across  the  array,  the 
actual  waveform  observed  at  the  m<A  sensor  output  is 

K 

=  y  3k(t  —  rkm)  nm(0  m  ~  1 . 2,  •  •  ,  .V/  (4.54) 

*=1 

where  s*(t)  is  the  kth  source  signal,  nm(t)  is  the  additive  noise  at  the  m‘A  sensor  output, 
and  rkm  is  the  travel  time  of  the  signal  wavefront  from  the  kt>l  source  to  the  m,A  sensor. 

Information  concerning  the  various  source  location  parameters  can  be  extracted  by 
measuring  the  various  rkm.  In  the  passive  case,  one  can  only  measure  the  travel  time 
differences.  Let  the  Mth  sensor  be  the  reference  sensor,  and  set  rk\f  =  0,  then  rkm  measures 
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the  travel  time  difference  of  the  ktk  signal  wavefront  between  the  (m,  M)  sensor  pair. 

We  assume  that  the  various  signal  sources  are  relatively  far- field,  so  that  the  observed 
signal  wavefronts  are  essentially  planar  across  the  array.  In  this  case,  the  unknown  source 
location  parameters  are  their  bearings  or  directions  of  arrival.  To  simplify  the  exposition, 
we  suppose  further  that  the  array  sensors  are  co-linear.  Then, 


(4.55) 


where  dm  is  the  spacing  between  the  sensor  m  and  the  reference  sensor  M,  c  is  the  velocity 
of  propagation  in  the  medium,  and  9 *  is  the  angle  of  arrival  of  the  ktK  signal  wavefront 
relative  to  the  boresight. 

Substituting  (4.55)  into  (4.54)  and  concatenating  the  various  equations,  we  obtain 

K 

i/(‘)  =  J]5(<;?»)ia(‘)  (4  56) 

**  i 


where 


sk{t  -  7j  sin  ft) 
3k(t  -  72  sin  8k) 


s*(t;  £*)  — 


(4.57) 


Sk(t  -  7.vr-i  sin  9k) 

•k(t) 

and  7m  =  dm/e.  We  note  that  this  is  a  special  case  of  the  superimposed  signal  problem  of 


(4.1). 

A  statistical  ML  problem  for  estimating  the  unknown  directions  of  arrival  is  achieved 
by  a  further  statistical  modeling  of  the  various  signals  in  (4.56).  We  will  now  present  the 
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V1L  problem  and  its  solution  using  the  EM  algorithm,  where  we  model  the  signals,  {«*(<)}, 
and  the  noise  a a  Gaussian  processes. 


4.4.2  The  Gaussian  case 


Suppose  that  the  various  s*(t )  and  the  various  nm(t )  are  mutually  independent,  WSS, 
zeromean  Gaussian  processes  with  spectral  densities  S*( w)  and  ,Vm(w)  respectively.  We  will 
also  assume  that  the  observation  time,  T  =  Tj  -  7*,-,  is  long  compared  with  the  correlation 
time  (inverse  bandwidth)  of  the  signals  and  the  noises.  Under  these  assumptions  we  may 
write  the  likelihood  of  the  observations,  g(t),  in  the  frequency  domain;  the  ML  estimates  of 
■•*.0*  will  be  achieved  by,  (see  Eq.  (4.33)), 
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and  xV(w)  is  the  diagonal  matrix 


-V,(w) 

ATj(w) 

.V(w:g)  =  .  (4.61) 

■  • 

Again,  as  in  the  previous  examples,  the  resulting  ML  problem  requires  the  solution  of 
a  complicated  multi-parameter  optimization  problem  in  several  unknowns.  Consequently, 
numerous  ad-hoc  solutions  and  sub-optimal  approaches  have  been  proposed  in  the  recent 
literature,  e.g.  58,59.60,61,621,  to  by-pass  this  ML  problem.  Still,  the  most  common 
approach  consists  of  beamforming  and  searching  for  the  K  highest  peaks.  As  noted  above, 
if  the  various  sources  are  widely  separated,  this  approach  is  nearly  optimal.  However,  when 
the  sources  are  closely  spaced  we  are  likely  to  obtain  poor  estimates  of  the  various  DOA’s. 

Identifying  the  model  in  (4.56)  as  a  special  case  of  the  superimposed  signal  in  noise  case, 
the  algorithm  specified  by  (4.43)  and  (4.44)  is  directly  applicable,  where 

A*(w;8*}  =  5*(w)K(w;0OV:,(w;*k)  +  0fclV(w)  (4.62) 


This  special  form  of  the  matrix  A*  allows  the  following  simplifications.  We  may  write 

detA  *(«;$*)=  1  + ^-5%(w)V.f(w;^),V-l(w)V:(u;;8t)  ■  3k  det  ,V(w)  (4.63) 

Pk 


A.- V«.)  =  £  [.v-'M  -  *»)  at  -  *  (<-; 

(4.64) 

Substituting  (4.63)  and  (4.64)  into  (4.44)  and  ignoring  the  terms  that  are  independent  of 
0*,  the  M  step  of  the  algorithm  will  be  simplified.  The  resulting  EM  algorithm  is: 
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•  Start,  n  =  0  ,  initialize  9^\  k  =  1,  •  •• ,  K. 

•  Iterate  (until  some  convergence  criterion  is  met) 


-  The  E  step:  For  k  =  1, 2,  •  •  • ,  K  compute 

*in,M  =  A*(u;<;<n))-A*(u,«|^n))p-l(W<;^)A*(w<;^",)+  (4.65) 

^Akhi^yfwulW)^)  •  YJ  (ut\  0(n^)A*(u><;  ^*n)) 


-  The  M  step:  For  k-  1 , 2,  •  •  • ,  K 


9[n*l)  =  arg  max  £  Fk(uje)  •  Vt(w<;  9)N~1^)  •  ♦f’M  •  AT  l(w)V(w<;  9)  (4.66) 


where  F*(u/)  is  a  shaping  filter,  given  by, 


*•*(«)  = 


5fc(w) 


i/NM 


(4.67) 


-  n  =  n  -  l 


We  note  that  the  objective  function  in  (4.66)  is  the  array  beamformer,  where  the  product 
1S  substituted  by  its  current  estimate,  ♦^"'(v*).  The  algorithm  is  illustrated 
in  Figure  4.8.  This  computationally  attractive  algorithm  iteratively  decreases  the  objective 
function  in  (4.58)  without  ever  going  through  the  indicated  multi-parameter  optimization. 
Again,  the  complexity  of  the  algorithm  is  essentially  unaffected  by  the  assumed  number  of 
signals  sources.  As  K  increases,  we  have  to  increase  the  number  of  beamformers  in  parallel; 
however,  each  beamformer  output  is  maximized  separately.  . 


4.4.3  Simulation  results 

To  demonstrate  the  performance  of  the  algorithm,  we  have  considered  the  following 

example.  Our  array  of  sensors  consists  of  five,  co-linear,  evenly-spaced  sensors.  There  are 
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Figure  4.8:  The  EM  algorithm  for  multiple  source  DOA  estimation 
two  far-held  signal  sources  at  the  bearings 

9,  =0°  92=  10° 

relative  to  the  boresight.  The  array-source  geometry  is  shown  in  Figure  4.9.  The  signals 
and  the  noises  are  spectrally  flat  with  5*(w)  =  5  and  .Vm(w)  =  S,  over  the  frequency 
band  -W  '2,  W/2 !.  We  assume  that  S/S  =  1,  and  that  WT/2t  =  20  (so  that  the  post 
integration  SNR  per  channel  is  approximately  23dB).  The  array  length  is  taken  to  be  L  ~  6A 
where  X  is  the  wavelength  associated  with  the  highest  frequency  in  the  signal  band. 

In  Figure  4.10,  we  have  plotted  the  array  beamformer  response  as  a  function  of  the 
bearing  angle.  As  we  may  see,  the  conventional  beamformer  cannot  resolve  the  signal 
sources  and  thus  cannot  estimate  their  bearings. 

The  ML  estimates,  obtained  by  direct  minimization  using  exhaustive  search  of  the  ob- 
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Figure  4.9:  Array-Source  geometry 
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Figure  4.10:  The  conventional  Beamformer 
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jective  function  in  (4.58),  are 


0,  =  -0.0563  02  =  10.4556 

The  value  of  the  objective  function  at  the  minimum  (corresponding  to  the  value  of  the 
log- likelihood  function  at  the  maximum)  is 

J  =  159.0137 

We  have  also  computed  the  Cramer-Rao  bound  on  the  r.m.s  error  of  each  parameter 
estimate.  We  obtain 

<r(0, )  =  0.2680  <7(0: )  =  0.2722 

We  now  apply  our  algorithm.  In  Figure  4.11,  we  have  plotted  the  beamformer  response 
to  the  various  signal  sources  as  they  evolve  with  iterations.  In  Figure  4.12,  we  have  tabulated 
the  results  using  several  arbitrarily  selected  initial  guesses.  We  see  that  in  all  cases,  after  5 
to  10  iterations,  the  algorithm  essentially  converges,  within  the  Cramer-Rao  lower  bound,  to 
the  ML  estimates  of  all  unknown  bearing  parameters  simultaneously;  therefore  the  various 
signal  sources  are  correctly  resolved. 

4.5  Sequential  and  adaptive  algorithms 

Sequential  and  adaptive  algorithms  for  estimating  the  parameters  of  superimposed 

signals  in  noise,  based  on  the  EM  algorithm,  may  be  suggested  following  the  consideration  of 

chapter  3.  As  discussed  in  chapter  3,  in  general,  any  given  iterative  batch  EM  algorithm  may 

be  transformed  into  a  sequential  algorithm  using  the  stochastic  approximation  idea.  The  EM 

algorithms  suggested  in  this  chapter  for  both  the  deterministic  case  and  the  stochastic  case 

have  a  structure  that  may  support  recursive  E  and  M  steps.  However,  we  will  concentrate 
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Figure  4.11:  The  Beamibrmer  response  to  each  signal  source 
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Figure  4.12:  Table*  of  results  for  multiple  source  DO  A  estimation 
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in  this  section  on  the  deterministic  superimposed  signals  problem,  for  which  we  were  able 
to  develop  sequential  EM  algorithms  based  on  exact  EM  mapping.  These  algorithms  will 
be  presented  explicitly,  and  their  application  to  solving  the  problem  of  multiple  sinusoids 
in  noise  will  be  described. 


4.5.1  Sequential  algorithms  based  on  exact  EM  mapping  for  the  Deter¬ 
ministic  case 


Sequential  EM  algorithms,  based  on  exact  EM  mapping,  are  achieved  by  examining  the 
expression  of  the  EM  iteration,  which  depends,  in  general,  on  the  entire  past  observations, 
and  recognizing  the  terms  that  depend  on  the  current  data  and  the  terms  that  depend  only 
on  past  data.  Hopefully,  the  terms  that  depend  on  past  data  may  be  summarized  into  a 
compact  form,  that  will  be  subsequently  updated  and  recorded.  Based  on  these  recorded 
quantities  and  the  new  measurements,  the  parameters  will  be  updated  using  an  exact  EM 
iteration. 

Thus,  let  us  consider  the  EM  iteration  for  the  deterministic  signal  case,  given  by  equa¬ 
tions  (4.19)  and  (4.20).  We  will  assume  that  the  signals  are  discrete  so  that  the  integral  in 
(4.20)  is  replaced  by  a  sum.  Assume  that  we  observe  y(l),  ••  •,  y(n)  ,  i.e.  the  observation 
index  is  t  =  1,  •  •  ■ .  n.  The  E  and  M  steps  of  this  EM  algorithm  are  given  by, 


•  The  E  step:  For  k  =  1,2, ••*,  K  and  t  =  !,•••,«  compute 


k(0  -EiM"1) 

t=l 


(4.68) 


•  The  M  step:  For  k  =  1, 2,  •  •  • ,  K 

*  argmin  ]T  [iin)  -  **(*;£*)] f  Q;1  [i(tn|  -  (4.69) 

**  1*1 
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We  will  assume  that  <3*  =  7*/  Combining  the  E  and  M  steps,  and  ignoring  the  terms  that 
are  independent  of  9,  we  may  write  the  EM  iteration  as  follows.  For  Jfe  =  1,  2,  •  •  • ,  K 

9{kn~l]  =  argmin  E  $*(*;  2*)l*  +  A  E  '2/?e  E  **(*?  ~ 

-*  1=1  t=i  r=i 

-  E «*(<; tfVfcfr «*)  •2/?ei2(t)^k(t; **),'  (4.70) 

t=i  i=l 

We  notice  that  the  term  that  depends  on  the  observations,  jj(t),  is  the  cross-correlation 
between  j f(t)  and  the  various  signals,  &*(<;£*)  We  will  denote  this  term  by, 

n 

pn(9J  =  E  £*)  (4.7i) 

t=i 

Suppose  we  record  pn(9k).  At  time  n+1,  when  a  new  measurement,  g(n  ■+■  1),  arrives,  this 
term  may  be  updated  recursively  as, 

Pn+l  (9-k)  =  Pnlik)  *■  H(n  +  I:  £k)  (4.72) 

The  other  terms  depend  only  on  the  a-priori  given  waveforms,  {a*(t;  £*)}.  In  many  cases 
the  expressions 

A.(«4,«*)  =  E2r(ti^)f4*U;2t)  (4.73) 

i=i 

may  be  given  for  each  n  by  an  a-priori  analytic  formula.  However,  even  if  the  algorithm  needs 
to  calculate  these  terms  explicitly,  they  may  be  calculated  recursively  using  the  following 
formula, 

Rn+l(&tflk)  —  Rn(iiilk)  "*  &t(n  -  *  1  i  £*)  (4.74) 

The  term 

£n(g*)  =  E  i*(‘;^)l2  (4.75) 

1=1 
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which  represent  the  total  energy  of  the  signals  3*(t)  may  also  be  calculated  recursively  or 
may  be  given  by  an  analytic  formula.  In  many  cases  this  energy  term  takes  the  form  of 


£n(£*)  =  n-  !<tfci2 


(4.76) 


i.e.  it  depends  only  on  the  amplitude  parameter  a*,  and  it  is  independent  of  the  other 
parameters. 

Thus,  the  sequential  EM  algorithm  for  superimposed,  deterministic  signals  is  given  by: 

•  Start,  n  =  0  :  Guess  2*°’.  Initial  po  =  Rq  =  Eq  =  0. 

•  While  data  is  observed 


-  Update  the  p„-*i’s  by  (4.72),  the  i’s  by  (4.74)  and  the  £n+i’s  by  (4.75). 

-  Update  the  parameters:  For  k  =  1, 2,  •  •  • ,  K  , 


=  arg  min 


En~l(tk)  *  3k  ■  2  Ref;  Rn~  i{i\n\tk)  -  Rn*i(i[n),9_k)  -  3  k  •  Pn. 


1=1 


(4.77) 


Store  pn.»i,  j,  En-+ 1- 


-  n  =  n  -  1. 


We  note  that,  as  in  any  exact  mapping  sequential  EM  algorithm,  we  can  perform  few 
iterations  for  each  observed  data  point.  The  advantage  is  that  we  have  to  update  the 
quantities  pn.  and  En  only  once  for  each  new  measurement.  Sometimes  it  will  be  more 
efficient  to  perform  few  more  iterations  before  moving  to  the  new  data.  However,  in  other 
cases,  exhausting  the  previous  data  cannot  improve  the  parameters;  it  is  more  efficient  to 
proceed  and  add  the  new  data  points. 


4.5.2  Application  to  sum  of  sinusoids  in  noise 

The  sequential  EM  algorithm  above  has  been  applied  to  the  following  problem.  Let  the 
observed  signal  be  the  sum  of  complex  exponential  in  noise,  i.e. 

K 

y(t)  =  £  ake'““  +  n(t)  (4.78) 

fcsl 

where  n(t)  is  a  white  Gaussian  noise  with  variance  o~ .  The  unknown  parameters  are  the 
frequencies  of  the  complex  exponentials,  {wk},  and  the  complex  amplitudes,  {a*}.  We  note 
that  this  problem  is  essentially  the  problem  of  sinusoids  in  noise,  and  in  this  case,  we  write 
a*  =  rkeJ4i,  where  the  (ok}’s  are  the  unknown  phases  of  the  sinusoids.  We  will  assume 
that  the  observations  are  given  at  time  points  t  =  0, 1,  •  •  • ,  n,  •  • 

The  complete  data  for  this  deterministic  superimposed  signal  example  is  the  set  of 
signals,  {xk(t)}f_,,  where 

xk(t)  =  ateJ"‘  4-  nk(t)  (4.79) 

and  {nk(t)}£-i  are  independent  white  noise  signals  whose  sum  is  n(t).  Each  nk(t)  has  a 
variance  3ka2 

In  this  application  the  sequential  EM  algorithm  presented  above  is  further  simplified. 
We  first  note  that  the  M  step  of  an  EM  iteration  in  this  case  requires  solving  the  following 
maximization  problems,  for  *  =  1,  •  •  • ,  K 

Jkn~l)  =  argm«i  ^xi")(<)e‘Jh',[*  =  (4.80) 

-  «*o 

where  x[n,(w)  is  the  Fourier  Transform  of  the  signal,  x*n*(t),  estimated  in  the  E  step.  The 
amplitude  coefficients  may  be  found  either  as  implied  by  the  EM  iteration, 

a("~D  =  (4.81) 

n  -  *  l«0 
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-  or  by  a  solving  the  least  squares  problem  of  (4.82),  i.e. 

a{n.t)  _  S-lyn+l  (4.88) 

where  a  is  the  vector  of  the  amplitudes,  S  is  the  matrix  whose  (fc,  l)  element  is 
given  by 

Sk.t  =  sinc;+I(w^"'1!  -  w<n+1>)  (4.89) 

and  Y.n+i  ‘9  4  v«ctor  whose  kth  component  is  Vn^l(w^n+l*). 

Numerical  simulation  example 

This  algorithm  has  been  tested  using  the  following  example.  The  sequentially  arriving 
observed  signal,  y (t),  is  complex  and  discrete;  it  consists  of  three  complex  exponentials  in 
additive  white  noise,  i.e. 

3 

y(t)  =  £  +  n(t),  i  =  0,  l.  •  •  •  (4.90) 

*=i 

The  additive  noise  is  spectrally  flat  with  spectral  level  <j:  =  0.1.  The  normalized  fre¬ 
quencies  of  the  complex  exponentials  were  chosen  to  be, 

jj\  —  0.025,  wj  =  0.03,  wj  —  0.04 

The  magnitude  of  the  complex  amplitudes  were  chosen  to  be  uniformly  1,  and  their 
phases  chosen  as, 

©l  =  0,  ©j  =  t/6,  0s  =  x/ 4 

We  have  tested  the  algorithm  given  by  (4.86)  and  (4.88),  sequentially  using  500  data 
points.  A  single  EM  iteration  has  been  performed  for  each  new  data  point.  In  Figure  4.13 
we  have  tabulated  the  estimates  of  the  frequencies  as  a  function  of  time.  We  notice  that 
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Figure  4.13:  Frequency  estimates  as  a  function  of  time 


this  efficient  sequential  algorithm  correctly  estimates  the  various  frequencies  after  observing 
120  data  points.  This  data  record  is  shorter  than  the  record  needed  to  correctly  resolve 
these  sinusoids,  using  the  standard  spectral  estimation  methods. 


Chapter  5 


Maximum  likelihood  noise 
cancellation 


The  problem  of  noise  cancellation  in  single  and  multiple  microphone  environments  has 
been  extensively  studied  63! .  The  performance  of  the  various  techniques  in  the  single 
microphone  case  seems  to  be  limited.  However,  enhancement  systems  with  two  or  more 
microphones  have  been  more  successful  due  to  the  availability  of  reference  signals. 

In  this  chapter,  noise  cancellation,  based  on  a  two  sensor  scenario  as  indicated  in  Fig¬ 
ure  5.1.  is  considered.  One  sensor  (the  primary  microphone)  measures  a  signal  that  consists 
of  speech  with  noise.  The  second  sensor  (the  reference  microphone),  located  away  from  the 
speaker,  measures  a  signal  that  consists  mainly  of  the  noise.  The  signal  measured  in  the 
reference  microphone  is  used  to  cancel  the  noise  in  the  primarv  microphone.  A  reasonably 
general  model  for  this  scenario  is  shown  in  Figure  5.2. 

The  most  widely  used  approach  to  noise  cancellation,  based  on  two  microphones,  was 
suggested  by  Widrow  et  al.  lOj.  In  this  approach,  it  is  assumed  that  the  system  B  is 
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Figure  5.3:  “Classical”  noise  canceling  scheme 

zero  and  that  C  and  D  are  identity,  so  that  the  output  of  the  reference  microphone  is 
due  only  to  the  noise,  and  that  the  noise  component  in  the  primary  microphone  is  the 
output  of  an  unknown  linear  system  with  transfer  function  .4(z),  whose  input  is  the  signal 
measured  in  the  reference  microphone.  The  coefficients  of  the  impulse  response  of  this 
system  are  estimated  by  a  least-squares  fitting  of  the  reference  microphone  signal  to  the 
primary  microphone  signal.  This  method  will  be  referred  to  later  in  this  chapter  as  the 
least-squares  method. 

Widrow  et.  al  proposed  an  adaptive  solution  to  this  least-squares  problem,  based  on 
the  LMS  algorithm.  This  approach,  illustrated  in  Figure  5.3,  has  been  applied  in  a  speech 
enhancement  coatext,  e.g.  64 1  and  '65! .  Adaptive  algorithms  based  on  the  RLS  algorithm 
also  exist,  e.g.  66]  and  67!. 

A  major  limitation  of  the  least-squares  method,  especially  when  the  reference  signal  is 
correlated  with  the  desired  (speech)  signal,  is  that  a  portion  of  the  desired  signal  may  be 
canceled  together  with  the  noise.  Since  the  desired  signal  may  be  canceled  with  some  time 
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delay,  the  resulting  effect  is  to  introduce  a  reverberant  distortion  in  the  output. 

Our  approach  consists  of  formulating  the  problem  as  a  statistical  maximum  likelihood 
problem.  This  approach  will  allow  us  to  consider  a  more  general  model,  that  includes  the 
effect  of  "cross  talk",  i.e.  the  coupling  of  the  desired  signal  into  the  reference  microphone. 
As  m  many  examples  throughout  this  thesis,  solving  the  resulting  ML  problem  directly 
is  difficult,  and  so  it  is  solved  using  the  EM  method.  The  proposed  algorithm  iterates 
between  estimating  the  speech  and  the  noise  source  signals  (E  step)  and  solving  a  set  of 
linear  equations  for  the  coefficients  of  the  acoustic  impulse  response  (M  step). 

It  is  interesting  to  note  that  the  proposed  algorithm  is  similar  to  the  iterative  speech 
enhancement  method  for  single  microphone  suggested  in  ;  31 .  As  already  noted,  the  iterative 
Wiener  filter  used  in  3l  is  an  instance  of  the  EM  algorithm.  In  that  respect,  the  procedures 
developed  in  this  chapter,  may  be  considered  as  extensions  of  the  method  in  31  to  two 
microphones. 

The  methods  presented  in  this  chapter,  can  be  used  as  an  alternative  to  the  least- 
squares  method  of  101  and  its  derivatives,  e  g.  681  and  69i.  Simulation  results  indicate 
that  the  proposed  schemes  tend  to  eliminate  the  reverberant  distortion  encountered  in  the 
least-squares  method.  Adaptive  versions  of  the  proposed  algorithms  are  also  possible.  We 
finally  note  that  the  proposed  scheme  can  easily  be  extended  to  the  more  general,  multiple 
microphone  case. 

This  chapter  is  organized  as  follows.  In  section  5.1.  we  develop  the  general  maximum 
likelihood  formulation  of  the  noise  cancellation  problem.  In  section  5.2,  we  apply  an  EM 
algorithm  to  solve  the  ML  problem  in  a  simplified  scenario,  that  basically  makes  the  same 
assumptions  as  in  10).  We  then  describe,  in  section  5.3,  the  EM  algorithm  for  a  more 
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general  scenario  that  includes  ■‘cross  talk'’-  We  conclude  this  chapter,  in  section  5.4,  by 
presenting  several  simulation  results  including  some  that  use  a  simulated  realistic  room 
impulse  response. 

5.1  Maximum  likelihood  formulation  of  the  twc-sensor  noise 
cancellation  problem 

The  mathematical  ML  formulation,  encountered  in  a  two* microphone  noise  cancellation 
problem,  is  based  on  the  following  scenario.  A  desired  (speech)  signal  source  and  a  noise 

source  both  exist  in  some  acoustic  environment,  say  a  living  room  or  an  office.  We  have 

two  microphones  used  in  such  a  way  that  one  microphone  is  intended  to  measure  mainly 
the  speech  source,  while  the  other  is  intended  to  measure  mainly  the  noise  source. 

The  desired  signal  and  the  noise  are  both  coupled  into  each  microphone  by  the  acoustic 
field  in  this  environment.  This  situation  is  illustrated  in  Figure  5.2,  and  is  represented  by 
the  equations  1 

Vi(0  =  C{s(t)}*A{w(t)}-e,(t)  (5.1) 

n(t)  =  B<s(t)}-D{w(t)}-':<t)  (5.2) 

where  a(t)  denotes  the  desired  (speech)  signal  and  u/(t)  denotes  the  noise  source  signal.  The 

systems  A,  B,  C  and  D  are  assumed  to  be  linear  systems,  representing  the  acoustic  transfer 

functions  between  the  sources  and  the  microphones.  We  will  assume  that  these  systems 

are  time  invariant  in  our  analysis  window  The  additional  noise  sources  ei(t)  and  ej(t)  are 

included  to  represent  modeling  errors,  microphone  and  measurements  noise  etc. 

‘The  mathematic*  and  the  algorithm*  will  be  formulated  in  term*  of  di*cret*  time  signals  with  the 

independent  variable  t  representing  normalised  sample  time 
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Under  these  assumptions  we  may  write  the  observed  signals  in  the  frequency  domain  as. 


Vl(w) 

=  c(w)SM  - 

-  E,(w) 

(5.3) 

=  S(w)5p)  -  £>(^)W(w) 

—  Eli*') 

(5.4) 

where  V"i(u>)  and  yj(w)  are  the  Fourier  transforms  of  the  the  observed  signals  yj(t)  and 

Stt(0.  i-e- 

K(w)  =  -=  Y.  V*(0e";“‘  (5.5) 

v  •’  t=o 

In  the  more  general  case  of  \f  microphones  and  K  noise  sources,  the  observed  signal 
may  be  written  (in  the  frequency  domain)  as, 

HM  =  A(*)S(«)  *  R(u)W(u)  ^  £(w)  (5.6) 

where  £(w),4( w)  and  £(w)  are  1  x  M  vectors.  W(w)  is  1  x  K  vector  and  £(w)  is  K  x  M 
matrix. 

To  formulate  a  statistical  maximum  likelihood  problem,  we  make  the  following  assump¬ 
tions.  The  noise  source  signal,  u/(t),  is  assumed  to  be  a  sample  from  a  Gaussian  random 
process.  The  desired  speech  signal,  a(t),  is  modeled  in  many  cases  as  an  AR  Gaussian 
random  process,  whose  parameters  (the  LPC  parameters)  are  slowly  time  varying.  For  our 
purposes,  in  a  short  analysis  window,  we  assume  that  those  parameters  are  constant,  and 
thus,  in  the  mathematical  formulation,  the  desired  signal  is  also  assumed  to  be  a  sample 
from  a  stationary  AR  Gaussian  process.  The  error  signals  ei(t)  and  e2 (t),  are  modeled  as 
white  Gaussian  noise  processes.  The  signals  s(t),  w(f),ei(t)  and  e2(t)  are  assumed  to  be 
uncorrelated. 

The  unknown  parameters  are  the  coefficients  of  the  various  systems  and  some  spectral 
parameters  of  the  signals.  We  denote  the  power  spectra  of  s(t)  and  w(t)  by  P,( w)  and 
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P„(w)  respectively,  and  <r*,  will  denote  the  error  signals  variances.  .4(w  ).B(w),C(w) 
and  Z?(w)  are  the  frequency  responses  of  the  four  linear  systems  in  Figure  5.2. 

We  formulate  the  problem  in  terms  of  short-time  processing  so  that  the  signals  and  the 
system  parameters  can  be  slowly  time  varying;  consequently,  a  sliding  window  is  applied. 

As  already  noted,  the  window  length,  T.  must  be  short  enough  so  that  the  parameters 
are  constant  over  its  duration.  However,  we  will  also  assume  that  it  is  long  enough  so 
that  the  short-time  DFT  coefficients  of  s{t) ,  w(t) ,  e\(t)  and  ej(t)  at  different  frequencies  are 

e 

uncorrelated.  Under  this  assumption,  the  likelihood  of  the  observations  (yi(t)  and  y?(f )) 
with  respect  to  the  parameters  above  is  easily  expressed  in  the  frequency  domain,  and  is 
written  as,  (see  e  g.  1:,  chapter  4), 

l°8P(yi(0- !«(*):  $)  =  ~  (logdet  A(w<;g)  +  “t  =  jr'*  (5.7) 

where  il(w)  is  a  vector  whose  components  are  Ki(w)  and  Yz(~).  The  matrix  A (w;£)  is  the 
power  spectrum  matrix,  i.e. 

C(w)/>  MC'M  -  A(w)P.(wM-(w)  -  <  C(«)P,(w)B*(w)  -  AMP„(w)0» 

B[w)P,{*)C-[u)  -  Z?(w)P.(«)A*(w)  B(w)P,(w)B*(u/)  -  D(w)P»(w)D*(w)  -  <r;3 

(5.8) 

For  the  M  microphone  case,  the  likelihood  function  is  again  (5.7)  where  the  matrix  A 

is  now  the  M  x  M  power  spectrum  matrix  £'{il(w)ll,(u/)}. 

The  general  maximum  likelihood  problem,  represented  by  eqs.  (5.7)  and  (5.8),  is  not 

only  complicated  but  may  also  be  ill-posed.  The  likelihood  function  depends  on  the  pa* 

rameters  only  through  the  matrix  A(w,  £),  and  all  possible  solutions  that  generate  the  same 

A(w;  i)  have  the  same  likelihood.  If  indeed  all  the  associated  systems  and  the  power  spectra 
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are  unknown  and  their  structure  is  allowed  to  be  arbitrary,  then  we  expect  a  non-unique 
solution,  since  any  value  of  A(w;  9)  may  correspond  to  many  different  values  for  the  param¬ 
eters.  Therefore,  some  constraints  must  be  imposed  on  the  parameters.  For  example,  we 
may  assume  that  some  of  the  parameters  are  known,  or  that  there  is  a  simple  structure 
to  the  systems.  Of  course,  the  more  constraints  there  are,  the  more  this  ML  problem  be¬ 
comes  well-posed  mathematically  However,  we  must  limit  ourselves  to  constraints  that  are 
consistent  with  the  noise  cancellation  problem  under  consideration. 

We  will  constrain  the  systems  that  represent  the  room  acoustics  to  be  causal,  and  to 
have  a  finite  impulse  response.  Thus,  for  example,  ,4(w)  is  a  frequency  response  of  an  FIR 
filter,  i.e. 

A(w)  =  (5-9) 

*= o 

As  mentioned  earlier,  we  will  usually  assume  that  s(t),  the  desired  signal,  is  a  sample 
from  an  AR  process  of  order  p.  so  that  its  power  spectrum,  P,(w),  is  of  the  form 


P,(w) 


(5.10) 


In  the  next  two  sections,  more  specific  situations  will  be  considered,  and  additional  con¬ 
straints  and  assumptions,  based  on  the  additional  knowledge  about  the  underlying  scenario, 
will  be  made.  In  both  sections,  the  resulting  ML  problem  is  constrained  enough  so  that  it 
is  not  ill-posed. 

We  note  that  even  with  these  assumptions  and  constraints,  the  required  maximization 
of  the  likelihood  function  (5.5)  with  respect  to  the  signal  and  system  parameters  is  still 
complicated.  Therefore  the  EM  algorithm  will  be  proposed  for  its  solution.  In  the  cases 
considered  in  the  next  sections,  the  unavailable  desired  signal,  s(t),  will  be  a  component 


of  the  chosen  complete  data.  Thus,  as  a  by  product  of  the  use  of  the  EM  algorithm,  an 
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estimate  of  the  desired  signal  will  become  explicitly  available  while  implementing  the  E 
step. 

Relation  to  array  processing  and  the  previous  chapter 

The  system  model,  presented  above  and  illustrated  in  Figure  5.2,  may  also  model  the 
array  processing  situation  and  the  DOA  estimation  problem  considered  in  the  previous 
chapter.  In  the  array  processing  case,  the  various  systems  A ,  B,  C  and  D  are  simple 
delays,  e  g.  ,4(w)  =  e'Jwr"  The  additional  noise  sources,  e,(t),  will  represent  in  this 
case  the  background  noise.  Maximizing  the  likelihood  function  (5.7)  will  be  equivalent  to 
maximizing  the  likelihood  function  developed  for  the  Gaussian  DOA  estimation  problem, 
i.e.  solving  (4.58). 

The  EM  algorithms  suggested  in  this  chapter  are  different  and  do  not  reduce  to  the 
algorithms  presented  in  the  previous  chapter.  We  simply  choose  a  different  complete  data 
for  solving  the  same  ML  problem.  The  choice  of  complete  data  in  this  chapter  is  adequate 
for  the  noise  cancellation  problem,  since  the  resulting  EM  algorithm  generates  an  estimate 
of  the  desired  signal  in  the  E  step  and  since  in  the  case  where  an  estimate  of  a  complete 
impulse  response  and  not  just  time  delay  is  desired,  updating  the  parameters  using  this 
choice  of  complete  data  is  easier.  On  the  other  hand,  for  the  DOA  estimation  problem, 
where  the  systems  are  simple  delays,  the  choice  of  complete  data  used  in  the  previous 
chapter  has  generated  an  EM  algorithm  with  attractive  properties,  such  as  the  simple 
parallel  structure. 
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Figure  5.4:  The  observations:  simplified  scenario 

5.2  A  simplified  scenario 

In  this  section,  a  simplified  version  of  the  problem  is  assumed,  corresponding  to  the 
restriction  of  B{z)  to  be  zero  and  both  C{z)  and  D(z)  to  be  unity  in  Figure  5.2,  so  that 
Figure  5.2  reduces  to  Figure  5.4.  This  scenario  is  assumed  (at  least  implicitly)  by  Widrow 
et  al.  in  101.  In  this  scenario,  one  microphone  measures  the  desired  (speech)  signal  with 
additive  noise,  while  the  second  microphone  measures  a  reference  noise  signal,  which  is 
correlated  with  the  noise  component  of  the  signal  measured  in  the  first  microphone,  but 
has  no  signal  component  present. 

We  will  start  by  presenting  the  ML  problem  for  this  scenario.  This  ML  problem  is  a 
special  case  of  (5.7).  However,  as  we  shall  see,  the  availability  of  a  reference  signal,  which 

i 

contains  only  noise,  makes  this  likelihood  function  particularly  simple.  We  will  then  present 
an  EM  algorithm  for  maximizing  this  likelihood.  The  complete  data  will  be  composed  of 
the  signal  part  and  the  noise  part  of  the  primary  microphone  signal  separated  in  addition 
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to  the  reference  microphone  signal. 

5.2.1  The  ML  problem 

As  indicated  in  Figure  5.4,  the  observed  signals  are  yi(t)  and  y*(t),  A(z)  is  an  FIR  filter, 
e(t)  is  Gaussian  white  noise,  and  s(t)  is  the  desired  signal. 

Specifically,  then 

yi(t)  =  s(t)  +  «(0  (5.11) 

where  the  noise  component  in  the  primary  microphone  is 

<? 

«(0  =  £  akin(t  -  k)  +  e(t)  (5.12) 

*=o 

Equivalently,  equations  (5. 11)  and  (5.12)  may  be  written  as, 

i 

»i(0  =  a(‘)  +  -  k)  ~e(0  (513) 

*=o 

v*(0  =  «*(*)  (5.14) 

and  the  connection  to  the  general  scenario  is  now  clear. 

.As  before,  we  assume  that  the  desired  signal,  s(t),  can  be  represented  as  a  sample  func¬ 
tion  from  a  stationary  Gaussian  process,  whose  spectrum  is  known  up  to  some  parameters. 
The  unknown  parameters,  g,  are  the  system  coefficients,  {at},  the  spectral  parameters  of 
s(t)  (which  will  be  denoted  6),  and  a2,  the  variance  of  e(t). 

The  likelihood  of  the  observation  is  again  expressed  in  the  frequency  domain.  This  case 
is  simpler  than  the  general  case,  discussed  in  the  previous  section.  The  likelihood  may  be 
obtained  without  referring  to  the  general  formula  of  (5.7). 

Specifically,  under  the  assumptions  made  in  the  previous  chapter,  the  likelihood  of  the 
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observations  may  be  written  in  the  frequency  domain  as, 

L{i)  =  *og/tfi.v:(yi(t)>y2(0;^)  =  X]l°g/Vi,y2(yi(wr),  Y2(uit)\9)  (5.15) 

Now,  at  each  frequency 


iog/K„K3(yiM,y2(w<);S)  =  log/K./K.tyiM/^M^  +  iog/r^y^w*))  (s.ie) 


where  log /y2(Vj(uj*))  is  independent  of  9.  The  conditional  density  of  Y\(<jJt)  given  y2(wf)  is 
given  by 


log  fruvAYiMYiM,®  =  -  log  ir  (P,(w<)  -c  ff2)  +  -(W-->;(^~-gr-W<)'2 


(5.17) 


Thus  maximizing  the  likelihood  (5.15)  in  this  case  is  equivalent  to  minimizing, 


H  log  ^2) 


ViM-  A(w,)  •  y2(u,()i* 

P.M  -  <r: 


(5.18) 


with  respect  to  a2  and  the  coefficients  of  P,(w)  and  A(w). 

We  will  assume  that  A(w)  is  the  frequency  response  of  an  FIR  filter  of  a  given  order  q, 
i.e.  it  is  of  the  form  of  (5.9).  Also,  we  will  assume  that  s(t)  is  an  AR  process  of  order  p 
with  coefficients  {h,}J’=1  and  gain  G.  so  that  its  power  spectrum  P,( w)  is  given  by  (5.10). 

In  some  applications,  like  LPC  vocoding  and  speech  recognition  of  noisy  data,  we  will 
be  interested  mainly  in  the  spectral  parameters  of  the  speech  signal.  In  this  case,  solving 
this  ML  problem  explicitly  provides  these  desired  parameters.  In  other  applications,  we  will 
be  interested  in  the  speech  signal  itself.  So,  using  the  estimated  signal  parameters,  {a*}, 
we  will  cancel  the  noise  in  the  primary  microphone  and  obtain  an  enhanced  speech  signal. 
As  mention  above,  this  speech  signal  estimate  will  be  available  as  a  by  product  of  the  EM 
algorithm  suggested  below,  while  implementing  the  E  step. 
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5.2.2  Solution  via  the  EM  algorithm 


Direct  minimization  of  (5.18)  is  complicated;  therefore  we  consider  the  use  of  the 
EM  algorithm.  In  this  approach,  the  complete  data  is  chosen  to  be  the  set  of  signals 
{s(t) ,  n(t) ,  yi(t)} .  This  choice  of  complete  data  is  motivated  by  the  simple  maximum  like¬ 
lihood  solution  available  if  indeed  s(t),  n(t)  and  yj(t)  are  observed  separately. 

Loosely  speaking,  if  this  complete  data  is  available,  the  maximum  likelihood  estimate  of 

{a*}  and  a2  is  achieved  by  least  squares  fitting  of  yt(t)  to  n(t).  The  spectral  parameters  of 

0 

s(t)  are  also  easily  estimated  by  solving,  for  example,  the  normal  equation  using  the  sample 
correlation  of  s(t),  for  LPC  parameters. 

More  specifically,  the  likelihood  of  the  complete  data,  Le(i),  satisfy 

Lc{9)  =  log  (s(t),n(t),y2(t);  6) 

=  log  («(<)•  n(t)/yj(t);  g)  +  log/„,(yj(0)  (5.19) 


where  log  fV:(yt(t))  is  independent  of  9.  Also,  given  yj(f),  the  signals  s(t)  and  n(t)  are 
statistically  independent,  and  thus 

log  («(0.»(t)/y»(t);  2)  =  lojA/nWO/yiMi)  *  iog/*/vs(n(0/y2(0;2)  (5. 20) 


Now,  log  /n/y5 (n(t)/yg(t); ?)  depends  only  on  {a*}  and  c r2,  and  it  is  defined  by  the  p.d.f 
of  e(t),  i.e. 

N- 1 


log /„/,,, ("(t)/yj(t);g)  =  -  £ 
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log*2  ^  ( "(0  ~  !C  a*W(*  ~  *)) 

"  V  *30.  / 


(5-21) 


In  general,  the  signal  yj(t)  may  be  related  to  s(t).  However,  this  relation  is  arbitrary 
and  unknown.  Therefore,  we  will  assume  that  the  probability  distribution  of  s(t)  given 
yj (t)  is  the  a-priori  distribution  of  s(t).  This  probability  distribution  is  the  distribution  of  a 
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stationary  random  process  with  power  spectrum  P,(v)  and  it  depends  only  on  the  spectral 
parameters,  0,  of  a(t),  thus. 


i°g/*'W(»(0/yj(0;2)  =  iog/,(s(0;®)  =  £ 


logP,(w<;  0) 


:5(wf)|J 


(5.22) 


where  S(w)  is  the  Fourier  transform  of  s(t),  i.e 

;V-1 


v (=0 


Thus,  estimating  9  by  maximizing  the  likelihood  of  the  complete  data  is  equivalent  to 

* 

estimating  <rJ  and  {a*}  by  minimizing 

\  Y,  (B(0  “  £a*V*(*  ~  *))  +  AMoga*  (5.23) 


(=0 


i=0 


and  estimating  the  spectral  parameters  0  by  minimizing 


(5  24) 


Note  that  when  s(t)  is  assumed  to  be  an  AR  process,  minimizing  (5.24)  is  equivalent  to 
solving  the  Yule- Walker  equation,  using  the  sample  autocorrelation  of  s(t). 

From  eqs.  (5.23)  and  (5.24)  we  note  that  the  sufficient  statistics  of  the  complete  data  is 
n(t)  and  S(w/),v  The  sufficient  statistics  is  linear  for  the  noise  part  and  quadratic  for  the 
signal  part.  Thus,  the  E  step  of  the  algorithm  requires  the  following  expectations: 


n(t)  =  E  {«(t)/yi(t),  yj(t);  ^n) } 


(5.25) 


and 

.VfsM  =  E  j  ;SMl2/Yi(wf),r,(w{);8(n»}  (5.26) 

where  9 denotes  the  parameters  { a*}, a :  and  0  in  the  n‘A  iteration.  These  conditional 
expectations  are  immediately  available  using  the  results  developed  for  the  linear  Gaussian 
case. 
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The  E  and  the  M  steps  of  the  EM  algorithm  for  minimising  (5.18)  may  now  be  stated 
explicitly.  We  will  denote  by  (or  by  {a^},  (<?2)'n*  and  Pjn*(w))  the  current  estimate 
of  the  parameters. 


•  The  E  step,  the  ni/l  iteration: 


-  Generate  a  signal  x(t) 


*(0  =  yi(‘)  ~  Eat  W  ~  fc) 


(5.27) 


*=0 


Note,  that  if  the  true  coefficients  (at)  were  known,  then  x(t)  =  s(t)  +  e(t) 

Apply  a  Wiener  filter  to  x(t)  to  obtain  the  conditional  expectation  or  the  min¬ 
imum  mean  square  error  estimate  of  s(t)  (or  S(w<))  and  [5(w<)|2.  Specifically, 
for  all  generate  an  estimate  of  S(w<),  E(ui)  and  |S(u;*)|2  as 

P,tn)M 


5(wr)  = 


pin,M  +  (o2)("> 

E(w<)  =  XW-SW 

M,M  =  ,smi2~ 

Pin>(wt)  +  (<T2)  ("> 


(5.28) 

(5.29) 

(5.30) 


where  X(w)  is  the  Fourier  transform  of  x(t)  and  £(w)  is  the  Fourier  transform 
of  the  signal  e(t) 


-  The  conditional  expectation  (estimate)  of  n(t)  is 

mo*  £«irW-*KM0  (5.3i) 

k*0 

•  The  M  step,  the  nth  iteration 

Substitute  the  conditional  expectations  of  (5.30)  and  (5.31)  into  equations  (5.23)  and 
(5.24).  Specifically, 
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Figure  5.5.  The  EM  algorithm;  simplified  scenario 

-  Update  {a*}  by  solving  the  least-squares  problem  of  (5.23)  with  (5.31)  substi¬ 
tuted  for  r»(i) ,  i.e. 

{a^*0}  =  argmin  ]T  (^{aj^  -  ak)y j(t  ~  k)  -  e(t))  (5-32) 

n=0  \*=o  / 

-  Update  the  spectral  parameters  by  solving  (5.24)  with  Ms(w*)  substituted  for 
;5(w<)|J.  For  LPC  parameters,  solve  the  Yule-Walker  equation  using  the  esti¬ 
mated  correlation  values,  obtained  by  inverse  Fourier  transforming  A/$(w). 

The  EM  algorithm  above  iterates,  until  some  convergence  criterion  is  met.  This  algo¬ 
rithm  is  summarized  in  Figure  5.5. 
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5.3  A  more  general  scenario 


The  modeling  of  the  two-microphone  noise  cancellation  situation  in  the  previous  section 
ignores  the  possible  coupling  of  the  desired  signal,  s(t),  into  the  reference  microphone.  In 
the  classical  least-squares  approach,  this  coupling  results  in  a  reverberant  quality  to  the 
output,  because  the  desired  signal  is  partially  canceled  together  with  the  noise.  Since 
^  the  ML  problem  of  the  previous  section  also  ignores  this  coupling,  the  resulting  EM  noise 

canceling  algorithm  has  a  similar  problem. 

In  the  ML  approach  considered  in  this  section,  this  coupling  is  taken  into  account. 
Specifically,  we  now  include  the  presence  of  the  system  B  in  Figure  5.2,  but  still  assume 
that  C  s  1  and  Dal,  corresponding  to  the  assumption  that  the  first  sensor  is  close  to  the 
signal  source  and  that  the  second  sensor  is  close  to  the  noise  source.  The  resulting  model 
is  shown  in  Figure  5.6.  We  also  assume  that  A(z)  and  B(z)  are  both  FIR  systems.  These 
assumptions  are  important,  because  without  them  the  problem  is  ill-posed.  For  example,  if 
i4, 5,  C  and  D  are  arbitrary,  intuitively  one  can  see  that  there  is  a  symmetry  to  the  problem, 
that  precludes  the  algorithm  distinguishing  between  the  signal  and  the  noise  components 
in  each  sensor.  With  the  stated  assumptions  this  symmetry  is  removed. 

We  will  start  by  explicitly  presenting  the  ML  problem  for  this  scenario.  We  will  then 
present  an  EM  algorithm  for  maximizing  this  likelihood,  where  the  compiete  data  will  be 
composed  of  the  desired  speech  signal,  s(t),  and  the  noise  source  signal,  u/(t),  in  addition 
to  the  observed  signals  yi(t)  and  yj(t). 
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Figure  5.6:  The  observations;  more  general  scenario 
5.3.1  The  ML  problem 

The  situation  assumed  in  this  section  is  indicated  in  Figure  5.6.  The  mathematical 
model  that  corresponds  to  this  situation  is  given  by, 

yi(0  =  s(«)  *  A{u»(t)}  *  ei(0  (5-33) 

yj(t)  =  £{*(()}  ""  u»(f)  +  ej(t)  (5.34) 

where,  as  before,  a(t)  is  the  desired  signal,  w(t)  is  the  noise  source  signal,  and  et(t)  and 
«:(()  are  the  measurement  and  modeling  error  signals  in  the  two  microphones.  As  in  the 
general  problem,  s(t)  and  w(t)  are  assumed  to  be  sample  signals  from  Gaussian  random 
processes.  The  error  signals  e\(t)  and  «j(t)  are  white  Gaussian  noise  processes.  The  un¬ 
known  parameters,  £,  are  the  impulse  response  coefficients  (a*}  and  {5*}  of  the  systems  A 
and  B,  the  spectral  parameters  of  the  signals  a(t)  and  w(t)  denoted  and  respectively, 
and  the  variances  a,,  and  of  the  noises  «i(t)  and  ej(t)- 

With  these  assumptions,  the  likelihood  of  the  observations  is  given  again  by  (5.7).  How- 
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ever,  with  C(w)  =  £?(w)  =  1,  the  power  spectrum  matrix,  A(w),  is  simplified  to 


A(w;«)  =  £T  {iK^iiKw)^  = 


PmM  ~  .4(u/)fU«-M-(w)  -  <r.:,  P,(w)fl*(w)  -  A(w)P„(w) 


(5.35) 


|  fl(w)P,(w)  -  P*MA-(w)  fl(w)P,(«)5-(w)  -r  Pw(w)  «-  *1  J 
We  will  assume  again  that  A(w)  and  fl(w)  are  frequency  responses  of  FIR  filters,  i.e. 

their  structure  is  given  by  (5.9).  The  orders  of  those  FIR  filters  are  assumed  to  be  known, 
and  are  denoted  qa  and  q>  respectively.  The  desired  signal  is  assumed  to  be  a  sample 
from  an  AR  process  of  a  given  order  p,  so  that  P,(w)  will  have  the  structure  of  (5 .10). 
We  further  assume  that  w(t )  is  a  white  noise  signal,  i.e.  P»( w)  is  constant.  Even  with 
these  assumptions,  the  underlying  ML  problem  is  complicated,  so  again  we  will  use  the  EM 
algorithm  for  its  solution. 

For  applications,  such  as  LPC  vocoding,  where  only  the  spectral  parameters  of  the 
speech  signals  are  required,  solving  this  ML  problem  will  explicitly  provide  these  desired 
parameters.  For  applications  where  the  speech  signal  is  required,  the  MMSE  estimate 
of  the  speech  signal  using  the  ML  estimate  of  the  parameters  will  be  suggested.  This 
MMSE  estimate  will  be  available  for  each  current  parameter  value,  as  a  by  product,  while 
implementing  the  E  step  of  the  suggested  EM  algorithm. 


5.3.2  Solution  via  the  EM  algorithm 

The  complete  data  suggested  for  defining  the  EM  algoritLm  in  the  current  context  is  the 
set  of  signals  {s(t),  u/(t),  yi(t)>  y?(0)  The  complete  data  is  chosen  this  way,  since,  if  indeed 
the  signals  s(t)  and  u/(t),  the  input  to  the  two  channel  system  of  Figure  5.6,  are  observed, 
in  addition  to  the  signals  yi(t)  and  y:(t)’  the  output  of  this  system,  then  there  will  be  a 
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simple  procedure  for  ML  estimation  of  the  parameters  of  this  two  channel  system. 

Specifically,  suppose  that  this  complete  data  is  available.  To  estimate  the  parameters 
we  will  maximize  its  likelihood  given  by 


Lc(£)  -  /ifi  ,112. »,«<(yi(0i  !/2(0 1  a(0 •  w(t),  9) 

=  »(*);«) +  (536) 

The  signals  yj(t)  and  y j{t)  are  uncorrelated,  given  s(t)  and  w(t).  The  signals  a(t)  and  w(t) 
are  uncorrelated  by  assumption,  thus. 


Lc(i)  =  log  fVl/,.w(y\(t)/*(t)' *(t)d)  +  log  .(yz(0/»(0> 

»■  ■■  . 1  '  '  ^  1  mi  1  ■ 

/  // 

log/,(s(t);0  +  log/.(w(t);«)  (5.37) 

^  * 

III  IV 

Term  /  depends  only  on  {a*}  and  and  is  the  log  probability  of  the  sequence  et(t). 
Similarly,  term  II  depends  only  on  {6*}  and  and  is  the  log  probability  of  the  sequence 
e2(t).  Term  III  is  the  log  probability  of  the  stationary  signal  i(t)  and  depends  only  on  its 
spectral  parameters  ©4  Similarly  term  IV  is  the  log  probability  of  the  stationary  signal  w(t) 
and  depends  only  on  its  spectral  parameters  d>w  Maximizing  the  likelihood  of  the  complete 
data  with  respect  to  9  is  equivalent  to  maximizing  each  of  the  terms  l  -  IV  separately  with 
respect  to  the  parameters  they  depend  on. 

Thus,  given  the  complete  data,  d,  are  estimated  by, 


±,  =  arg  max  log /,(*(<);*,)  =  argmin^ 


1  ogP,(w*;*J  + 


(5.38) 


and  *  are  estimated  by, 


*  =  argmax  log  /»(w(t);  *  )  =  arg  min  £ 
*  ♦ 


log  Pv(ut\£j 


P*{“niJ 


(5.39) 


Ml 


where  S(w)  and  W(w)  are  the  Fourier  transforms  of  s(<)  and  w(t)  respectively,  i.e. 


V  v  t=o 

W(*)  =  -L  £  w(t)e~^ 

V  iv  <=0 

The  maximization  in  (5.38)  is  sometimes  simpler,  e.g.  when  a(t)  is  assumed  to  be  an  AR 
process,  in  which  case  maximizing  (5.38)  is  equivalent  to  solving  the  Yule- Walker  equation, 
using  the  sample  autocorrelation  of  s(i).  Similarly,  solving  (5.39)  is  sometimes  simpler.  If 
w(t),  the  noise  source  signal,  is  a  white  noise  signal,  then  solving  (5.39)  is  equivalent  to 
finding  the  (constant)  spectrum  level,  Pw,  by, 

/*-  =  ^  =  £  W(w)|2  (5  40> 
Estimating  the  impulse  response  coefficients,  {at},  and  the  variance,  <r^,  given  the 
complete  data,  requires  solving  a  least-squares  problem,  since 


*?,.{«*}  =  "S  max  log  fVl/,.m(y\(t)ls(t),w(t);ao,  •■■.a, a,<r,2,)  (5.41) 

j  'V-l  /  J- 

=  arg  min  -r-  £  yx(t)  -  s(t)  -  ]T  akw[t  -  k) 

Similarly,  estimating  {6*}  and  given  the  complete  data  requires  solving  the  following 
least  squares  problem, 


+  ,V  •  log  a]x 


*),.{**}  =  max  log fn/, .»(y»(0/*(0.v(0;*0' (5  42) 


=  arg 


,  N-i  /  n  \  2 

min  -5-  £  (  w(0  ~  »(*)  ~  H  -  k)\  -  ,V  •  log*; 


The  explicit  solution  of  the  least-squares  problems,  implied  by  equations  (5.41)  and 
(5.42),  is  achieved  by  solving  the  following  “normal”  linear  equations: 


•a  =  r, 


*»1 
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(5.43) 


where  is  the  correlation  matrix  of  u >(t)  of  order  qa  i.e. 


and  the  vectors  a  and  6  are  the  unknown  impulse  response  coefficients  of  the  systems  A  and 
B. 

By  observing  the  required  procedures  for  maximizing  the  likelihood  of  the  complete 
data,  i.e.  equations  (5.38), (5. 39)  and  (5. 43), (5. 44),  we  see  that  the  sufficient  statistics 
of  the  complete  data  contains  quadratic  terms,  which  are  the  sample  autocorrelation  (or 
the  sample  spectrum)  and  the  sample  cross  correlation  (or  cross  spectrum)  of  the  various 
signals,  in  addition  to  the  linear  terms  (i.e.  the  signals  themselves).  Thus,  the  E  step  of 
the  algorithm  (with  the  above  choice  of  complete  data)  requires  the  expectations: 


i{t)  =  E{s(t)/yl(t),yi(t)A^}  (5.48) 

lir(t)  =  E  {«i/(t)/yi(t)»yi(0;fi(")}  (5.49) 

and  the  quadratic  terms: 

r,(k)  =  £7{r,(A:)/y1(t),y2(<);€(,*)}  (5.50) 

r.(fc)  =  £:{^(*)/y,(t),yz(t);£ln)}  (5.51) 

r,w{k)  =  £{r,w(*)/yi(t),y2(0;£(n'}  =  rvs(-k)  (5.52) 


We  will  implement  the  E  step  in  the  frequency  domain,  since  for  stationary  processes 
with  large  observation  time,  the  D FT  coefficients  at  each  frequency  are  statistically  inde* 
pendent  and  can  be  processed  separately.  In  each  frequency  the  observation  may  be 
written  as 

Ki(w<)  1  <4(w«)  5(w<) 

=  •  (5.53) 

VjM  B(wt)  1  IV  M 

The  E  step  requires  the  conditional  expectation  of  5(w<),  W(w<),  l5(w<)|1,  jW(w*)j2  and 
5(wi)r(«t). 
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At  each  step  of  the  algorithm,  the  current  values  of  the  parameters  are  used.  We 
will  denote  by  and  B^nl  (w)  the  current  estimate  of  the  frequency  responses  of  the 

unknown  systems  A  and  B.  and  by  P^,n\ u/)  and  Pi,n)(w)  the  current  estimate  of  the  power 
spectra  of  s(t)  and  w(t).  Let  H(ut)  denote  the  matrix 

=  (5.54) 

B{n)[ujt)  1 

and  let  $(w<)  and  E  denote  the  power  spectra  matrices 


(5.54) 


= 


Pin)(^)  o 


<  0 


[  0  P<n,(w<)  j  [  0  <r*,  j 

The  required  conditional  expectations  are  readily  available,  since  again  this  is  a  linear 
Gaussian  case.  These  conditional  estimates  may  be  interpreted  as  performing  a  two-channel 
Wiener  filter  (see  20i)  and  calculating  its  error  covariance  matrix.  Thus,  the  estimate  of 
the  linear  terms  is  given  by 


(5.56) 


5{w*) 

=  • 

YM 

*(««) 

where  is  the  matrix 

K (w<)  =  *(w<)  •  ff(w<)f  (ff(w,)  ■  *(u,t)  •  H( w,)f  -  S) 


(5.57) 


For  the  quadratic  terms,  we  have  to  calculate  the  error  covariance  matrix  of  this  Wiener 
filter,  i.e. 

*(w  t)  =  (*-I(w<)  +  /fM-E-‘-ffMt)'1 

=  -  *(wr)tf  (ff(w<)  •  ♦(*<)  •  Mf  +  s)  («<)*(*<)  (5.58) 
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and  the  quadratic  terms  are  obtained  by, 


=  E  |!5(w<)|2/Ki(o;t),  =  S(w<)|2  +  Ai(w*)  (5.59) 

\fw(ui)  =  E  |  ;Wr(wr)|2/Ki(wr),yj(ufr)|  =  j W(ut)\2  +■  ?n (w<)  (5.60) 

Xisw(ut)  =  E  {S(u/*)W  (*;*), Yi(w<),  Vj(«r)}  =  S(ut)W'(uit)  +  Pn(<*> <)  (5  61) 

The  E  and  M  steps  of  the  EM  algorithm  for  maximizing  the  likelihood  of  the  observations 
(given  by  (5.7)  and  (5.35))  for  this  more  general  case  may  now  be  stated  explicitly. 

•  The  £  step,  the  nth  iteration: 

-  Calculate  the  conditional  expectations  S(w/)  and  W(w*)  by  (5.56). 

—  Calculate  Ms(wt),  iVfw(w()  and  Msw(wr)  by  (5.59)-(5.61). 

-  The  signal  estimates  s(t)  and  w(t),  and  the  correlation  estimates  r,(Jfc),  fw(Jk)  and 
r,w(k)  are  achieved  by  inverse  Fourier  transforming  S(u/),W(w),  Ms(v),  Mw{v) 
and  Msw(u)  respectively. 

•  The  M  step,  the  nth  iteration: 

-  Solve  the  linear  equations  of  (5.43)  and  (5.44)  for  a  and  b,  using  the  estimates 
r.(k),  rw(k)  and  f,v(k)  from  the  E  step,  and  with 

fwAk)  =  4  52  li(f)yi(t  -  *) 

•Y  1=0 

f>n(k)  ~  4  s  *(*)»*(*  -  k)  (5.62) 

1=0 

The  result  is  the  updated  coefficients  and  of  the  systems  A  and  B- 

-  Update  the  spectral  parameter  estimate,  by  solving  (5.38)  and  (5.39),  using 

Ms(ut)  and  Mw(ut)  instead  of  iS(w*)j2  and  |W(w<)|2.  For  LPC  parameters  of 

the-speech  signal  s(t).  solve  the  Yule- Walker  equations,  using  f,(k). 
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Figure  5.7:  The  EM  algorithm;  more  general  scenario 

The  EM  algorithm  above  iterates,  until  some  convergence  criterion  is  met.  This  algo¬ 
rithm  is  summarized  in  Figure  5.7. 

The  procedures,  suggested  in  this  section  and  also  in  the  previous  section,  are  imple¬ 
mented  in  each  iteration  on  the  entire  data.  Adaptive  and  sequential  procedures,  based 
on  the  discussion  in  chapter  3,  are  also  possible.  These  algorithms  may  process  new  data 
in  each  new  iteration.  The  parameters  will  be  updated  according  to  one  of  the  suggested 
strategies  in  chapter  3,  and  a  new  segment  of  enhanced  signal  will  be  produced. 
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Examining  the  suggested  procedure  illustrated  in  Figure  5.7,  a  sequential  EM  algorithm 
based  on  recursive  E  and  M  steps,  comes  in  mind.  The  Wiener  filter  of  the  E  step  will 
be  replaced  by  the  sequential  Kalman  filter,  and  the  linear  least-squares  problems  of  the 
M  step  will  be  solved  via  a  sequential  RLS  type  algorithm.  The  details,  the  analysis  and 
experiments  with  this  adaptive  version  are  now  under  investigation  and  are  the  subject  of 
further  research. 

5.4  Experimental  results 

The  algorithms  developed  in  this  chapter  for  both  the  simplified  scenario  and  the 
more  general  scenario  have  been  applied  and  tested  in  a  simulated  environment.  A  realistic 
acoustic  environment  has  been  created  by  generating  the  impulse  response  coefficients  of  the 
systems,  representing  the  room  acoustics,  using  a  well  tested  acoustic  simulation  program 
701.  In  this  section  we  will  discuss  the  results  of  our  simulation  experiments. 

5.4.1  The  simplified  scenario 

The  simplified  scenario  of  Figure  5.4  has  been  implemented  with  s(t),  a  speech  signal, 
and  a  band  limited  noise  signal  with  a  fiat  spectrum  from  zero  to  3  KHz.  The  FIR 
filter,  A(z),  was  of  order  10.  y\(t)  was  generated  according  to  Figure  5.4.  The  SNR  in 
yi(t)  was  approximately  -20  db.  The  results  were  compared  with  a  “batch”  version  of  the 
least-squares  algorithm,  corresponding  to  estimating  the  {at}’s  via  the  least-square  problem 

(yi(t)  -  -  k) 

<“*>  t  \  k=i 
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Figure  5.8:  Correlation  between  the  reference  and  desired  signals.  C(z)  =  0.1r~s 
and  then  canceling  the  noise  and  estimating  the  signal  as 

5(0  =  yi(0  -  -  k) 

k=  1 

Both  algorithms  produced  good  enhancement  of  the  speech  signal,  and  although  there 
were  perceptible  differences,  the  overall  quality  of  both  was  similar. 

The  direct  least-square  approach  assumes  that  yi(t)  and  s(t)  are  uncorrelated.  This 
assumption  is  critical.  Our  algorithm,  on  the  other  hand,  does  not  require  this  assumption. 
In  a  second  experiment,  yj(t)  included  a  delayed  version  of  the  speech  signal,  as  illustrated 
in  Figure  5.8.  (Note,  that  this  scheme  is  different  then  the  scheme  considered  in  the  more 
general  scenario,  since  we  have  a  direct  measurement  of  the  input  to  the  system  /t(i)). 

In  this  experiment,  the  SNR  in  yj(t)  was  again  -20  db.  The  direct  least  squares  approach 
canceled  part  of  the  signal  together  with  the  noise,  resulting  in  poor  quality.  In  comparison, 
the  performance  of  our  algorithm  was  still  good. 
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Figure  5.9:  The  living  room  acoustic  environment 

5.4.2  The  more  general  scenario 

The  more  general  scenario  assumed  in  Figure  5.6  was  simulated,  where  again  s(i)  was  a 
speech  signal  and  w(t)  was  a  white  noise  signal.  In  order  to  simulate  a  realistic  scenario,  we 
assumed  a  living  room  environment  with  the  signal  and  noise  sources  located  as  illustrated 
in  Figure  5.9.  We  used  a  simulation  program  developed  by  Peterson  ([701),  and  we  generated 
FIR  impulse  responses  having  2000  coefficients  for  each  of  the  systems  .4  and  B.  The  first 
500  coefficients  of  these  impulse  responses  are  plotted  in  Figure  5.10.  By  monitoring  the 
level  of  the  noise  source,  we  have  generated  examples  in  which  the  SNR  of  yj  (t)  was  +20dB, 
OdB  and  -20dB. 

We  implemented  the  EM  algorithm  described  in  Figure  5.7  and  compared  the  results 
to  the  least-squares  method,  by  informal  listening.  Both  algorithms  estimated  up  to  500 
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Figure  5.10:  Simulated  ‘‘room  acoustics”  impulse  responses. 
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coefficients  of  the  impulse  response.  In  all  SNR  levels  our  algorithm  performed  better,  and 
its  output,  unlike  the  least  square  output,  was  reverberation  free. 

At  high  SNR  (+20dB),  the  output  of  the  least  squares  method  output  sounded  worse 
than  the  unprocessed  measurement  signal,  due  to  the  signal  canceling  effect.  The  output  of 
our  method  sounded  better  than  the  original  measurement  signal. 

At  OdB,  the  least  squares  output  sounded  better  than  the  measurement  signal.  However, 
it  sounded  much  worse  than  the  output  of  our  algorithm,  which  at  this  SNR  level  generated 
an  almost  clean  signal. 

At  -20dB  SNR,  the  output  of  the  ML  method  sounded  better  then  the  least-squares 
method.  However,  the  distinction  between  the  two  was  not  as  significant  as  in  the  case  of 
OdB  SNR.  This  is  perhaps  a  result  of  the  fact  that,  in  order  to  generate  a  low  SNR,  we 
increased  the  level  of  the  noise  source.  This  resulted  in  a  high  Noise  to  Signal  Ratio  in  the 
reference  microphone,  which  in  turn  resulted  in  lower  signal  cancellation,  since  the  situation 
became  closer  to  that  assumed  by  the  least  squares  method. 


Chapter  6 


Information,  Relative  Entropy 
and  the  EM  algorithm 


At  this  point  in  the  thesis,  we  have  already  presented  the  main  results  and  contributions. 
The  class  of  iterative  and  sequential  algorithms  has  been  presented  and  motivated.  We  have 
also  demonstrated  several  applications  of  these  algorithm  to  real  world  signal  processing 
problems.  In  the  course  of  the  thesis,  several  subjects,  related  to  information  and  the 
philosophical  essence  of  the  inference  process,  have  been  mentioned  briefly.  We  want  to 
use  the  opportunity  of  this  chapter  to  discuss  these  issues  further.  We  will  consider  in  this 
chapter  some  interesting  topics,  in  the  context  of  information  theory,  that  are  related  to 
the  EM  method  and  to  the  notion  of  complete  and  incomplete  data  relations. 

Information  measures  and  statistical  inference  criteria  are  closely  related.  The  books  by 
Kuilback  71]  and  Pinsker  [72|  are  devoted  to  information  theory  and  statistics.  The  Mini* 
mum  Description  Length  (MDL)  and  the  Minimum  Information  (MI)  criteria,  mentioned  in 
Chapter  2,  provide  examples  of  the  application  of  information  measures  to  inference  prob* 


153 


lems.  Another  criterion,  based  on  information  theory,  the  Maximum  Entropy  (ME)  or  its 
generalization  the  Minimum  Relative  Entropy  (MRE),  is  the  main  tool  used  in  [71l  to  relate 
information  theory  to  statistics.  It  will  be  interesting  to  show  that  the  ME/MRE  criterion 
is  a  special  case  of  the  MDL  criterion  in  a  special  complete-incomplete  data  context. 

Another  topic,  discussed  in  this  chapter,  is  the  alternative  derivation  of  the  EM  algo- 
I  hthm  using  the  MRE  criterion,  which  has  been  suggested  by  Musicus  in  [7]  and  [8j,  and  by 

Csiszar  et.  al.  [73).  This  derivation  is  presented  with  our  interpretation  and  comments.  In 
the  context  of  this  chapter,  where  general  information  criteria  are  analyzed,  this  derivation 
may  be  viewed  in  the  right  perspective. 

This  chapter  is  not  organized  as  coherent  theory,  rather  as  “variations  on  the  themes” 
above.  The  common  “motive”  is  the  relation  to  the  EM  method.  We  start  our  presentation 
with  the  ME  or  the  MRE  criterion,  point  out  that  its  philosophy  is  distinct  from  the  phi¬ 
losophy  of  the  ML  and  other  statistical  criteria  and  give  its  common  justifications.  We  then 
show  that  the  MRE  criterion  sometimes  reduces  to  the  ML  criterion,  and,  in  these  cases, 
its  minimization  using  the  alternate  minimization  algorithm,  reduces  to  the  EM  algorithm. 
However,  we  will  raise  some  doubt  concerning  the  rationale  behind  using  the  Minimum  Rel¬ 
ative  Entropy  in  some  contexts,  including  the  context  that  led  to  the  alternative  derivation 
of  the  EM  algorithm.  The  chapter  ends  with  an  important  result;  we  prove  that  the  ME  or 
MRE  criterion  can  be  viewed  as  an  interesting  implementation  of  the  MI/MDL  ideas  in  a 
special  complete/ incomplete  data  situation. 
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6.1  Maximum  Entropy  and  Minimum  Relative  Entropy 


So  far  iq  the  thesis  we  have  presented  several  methods  and  criteria  for  statistical  in¬ 
ference.  Maximum  Entropy  and  Minimum  Relative  Entropy  methods  flow  from  a  different 
philosophy,  however.  This  philosophy  will  be  presented  and  compared  to  the  philosophy  of 
other  statistical  inference  criteria. 


0.1.1  ME  and  MRE  in  comparison  to  other  statistical  criteria 

The  “output"  of  every  statistical  inference  method  is  a  ch'oice  of  probability  function, 
which  we  believe,  according  to  the  specific  criterion  we  use,  represents  best  the  behavior  of 
the  phenomena  that  we  observe.  However,  ME  or  MRE  methods  consider  the  observations 
in  a  unique  manner  and  accept  only  certain  forms  of  data. 

To  be  more  specific,  let  us  describe  the  common  situation  that  leads  to  the  application 
of  ME  or  MRE  methods.  As  mentioned  above,  the  goal  of  any  inference  process  is  to  find 
a  p  d  f..  p(z),  defined  over  the  sample  space,  X,  of  the  observed  phenomena.  Suppose  we 
know  that  p(-)  belong  to  a  set  P,  where  this  set  is  usually  defined  by  the  knowledge  of  some 
averages. 

P  =  {p(x)  Ep\g(x)\  —  g)  (6.1) 


The  given  averages  are  the  only  information  observed  from  the  underlying  phenomena  in 


the  ME  framework.  The  choice  of  probability  function  is  then  made  by, 


p  =  argmaxf(p)  =  argmax  -  /  p(x)  log  p{x)dz 
p€P  p€P  Jz 


(6.2) 


In  the  MRE  framework,  an  a-priori  probability,  ?(•),  for  the  observed  phenomena  is  also 
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given,  and  thus  the  choice  of  probability  function  is  made  by, 

p  =  argmm.V(p;?)  =  argmin  /  p(r)  log  (6.3) 

p£P  p€r  LJX  VK1) 

We  immediately  note  that  the  MRE  criterion  reduces  to  the  ME  criterion  if  ?(•)  is  the 
uniform  prior.  Therefore,  in  the  rest  of  the  chapter,  we  will  discuss  only  the  MRE  method; 
all  the  results  will  also  apply  to  the  ME  method. 

The  basic  limitation  of  the  MRE  method  is  that  one  cannot  incorporate  the  information 
provided  by  a  specific  observation  sequence.  The  only  information  that  can  be  incorporated 
is  in  the  form  of  given  averages  or  other  constraints  on  the  probability  function.  Neverthe¬ 
less,  the  MRE  method  is  sometimes  used  when  a  specific  observation  sequence  is  given.  In 
this  case,  one  usually  calculates  some  sample  averages  and  uses  them  as  constraints  on  the 
relative  entropy  minimization.  This  approach  is  certainly  not  an  optimal  one,  since  not  all 
available  information  about  the  phenomena  is  incorporated  and  since  errors  are  introduced 
in  the  inference  process,  because  the  sample  averages  differ  from  the  statistical  averages. 

The  other  statistical  inference  methods  consider  the  observed  data  directly.  For  example, 
in  the  Maximum  Likelihood  framework,  we  have  an  observation  x  €  X  and  we  assume  that 
the  probability  distribution  that  describes  z  is  characterized  by  some  unknown  parameter 
vector.  0  5  0.  The  ML  criterion  will  choose  p  by  choosing  0  €  ©  according  to, 

0  =  argmaxlogp(x;0)  (6.4) 

The  basic  limitation  of  the  ML  is  the  need  for  modeling  assumptions.  Without  those  re¬ 
strictions  the  method  will  break  down;  for  example,  if  we  allow  the  any  probability  function, 
maximizing  the  likelihood  will  lead  to  the  trivial  (but  unacceptable)  result  p(a)  =  8{a-  x), 
where  $(•)  is  the  Dirac  delta  function.  On  the  other  hand,  in  the  ME/MRE  framework,  we 
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do  not  assume  any  model;  the  method  will  work  even  if  the  set  P  in  (6.2)  and  (6.3)  contains 
all  possible  probability  functions.  The  constraints  on  the  probability  functions  are  derived 
from  the  data. 

To  summarize,  the  weakness  of  the  MRE  method  is  the  unique  and  restricted  manner 
in  which  it  can  accept  input.  The  MRE  method  can  consider  the  data  only  in  the  form  of 
averages.  Its  strength,  on  the  other  hand,  lies  in  the  facts  that  no  modeling  assumptions 
are  needed  and  that  a  full  probability  function  is  estimated.  There  are  situations  in  which 
giving  the  input  in  form  of  averages  is  natural;  for  example,  in  statistical  mechanics,  the 
observations  at  the  macroscopic  level  are  indeed  some  averages  of  a  stochastic  phenomena, 
that  occurs  at  the  microscopic  level.  In  these  cases,  following  the  rationale  of  the  MRE 
method  given  below,  the  usage  of  the  MRE  method  is  justified. 

6.1.2  The  rationale  of  the  MRE  method 

The  commonly  used  rationale  for  justifying  the  Maximum  Entropy  method  is  advocated 
mainly  by  Jaynes  74!  and  ;75j.  Here,  we  briefly  repeat  this  rationale. 

Suppose  that  the  sample  space  of  the  underlying  phenomena  is  discrete  and  finite, 
i.e.  the  random  variable,  X ,  whose  instances,  x,  we  may  observe,  takes  its  values  over 
the  finite  set  {l,  -,m}.  This  random  variable  has  an  a-pnori  probability  assignment, 

{?!>'••>  ?m}-  Suppose  we  observe  an  infinite  i.i.d.  sequence,  {xi  •••  x„  •  ■•},  of  realizations 
of  X.  A  question  we  might  ask  is  what  will  the  sample  frequencies  (or  histogram)  of  this 
sequence  typically  be  ?  Naturally,  by  the  strong  law  of  large  numbers,  the  answer  is  the 
a-priori  probability,  {ft}*!-  However,  suppose  we  have  additional  information  in  the  form 
of  constraints  on  the  possible  histograms,  maybe  a  knowledge  of  some  averages,  that  rules 

out  the  a-priori  assignment.  In  this  case,  we  will  compare  the  probabilities  of  all  possible 
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histograms;  the  typical  histogram  will  be  the  one  with  the  highest  probability. 


Specifically,  consider  first  the  case  of  a  finite  sequence,  f  =  {x\  zy},  of  i.i.d. 
realization  of  X.  With  the  given  probability  assignment,  {$}”,,  on  X,  the  probability  of 
getting  a  specific  sequence,  whose  sample  frequencies  are  where  pi  =  k,/S  and  fc, 

is  the  number  of  times  the  outcome  i  appeared  in  z,  is 

pit ) = n  •».*’ =  n  q*p'  (6-s) 

1=1  1=1 

There  are,  however,  z  r~ v  sequences  with  these  sample  frequencies.  Thus,  the  probability 
of  the  event  “the  sample  frequencies  of  z  are  {p»}”  ,  denoted  Prob{j>/q),  is  given  by, 

ft  «.*■  =  vrb '  ft  f  -  tea) 

*i  *"»•  ,=i  *"»•  ,=i 

It  is  easy  to  prove,  using  Stirling’s  formula  for  factorial,  that 

5a-iLa«*»  (6.r) 

where  £(g)  =  -  ^,”1,  p,  logp,  is  the  entropy  associated  with  the  frequencies  p,  =  4*/JV. 
Thus,  as  .V  —  oc  equation  (6.6)  becomes 

ProME/j>  .  •  n  ?;V  '•  -  eV  -  (6  8) 


or  equivalently. 


log  Prob{^/  j}  =  -  iV  £  p,  log  ^  =  -  IV  •  H  (g;  j) 
—  1  v* 


where  N(g;  j)  is  the  relative  entropy  between  {p,}” !  and  {?,}”, 

From  equation  (6.9)  we  see  that  the  relative  entropy  is  directly  proportional  to  -  log  Prob{g/q). 
Thus  the  histogram  with  the  highest  probability  is  the  one  that  minimizes  the  relative  en¬ 
tropy.  This  histogram  is  also  the  typical  histogram  in  the  following  sense.  Consider  the 
histogram. g  that  minimizes  the  relative  entropy  and  any  other  histogram  g'  with  higher 


relative  entropy  We  claim  that  the  probability  of  g  is  overwhelmingly  larger  than  the  prob¬ 
ability  of  g\  since,  as  N  —  oo,  the  ratio  between  these  probabilities  goes  exponentially  fast 
to  infinity, 


Prob{p/q)  _  _  es  a  a  >Q 

Prob{f/iq } 


(6.10) 


This  concludes  the  justification  of  the  minimum  relative  entropy  criterion. 

(The  argument  above,  the  probability  calculations  and  the  term  “typical  sequence”  are 
frequently  used  in  Shannon's  development  of  information  theory  |24|,  especially  for  his 
source  coding  theorem.) 


6.2  MRE  by  alternate  minimization  and  the  EM  algorithm 

In  this  section,  an  interesting  interpretation  of  the  CM  algorithm  is  provided,  using  the 
MRE  criterion.  This  interpretation  is  based  on  the  fact  that  the  MRE  criterion,  used  in 
a  special  way,  reduces  to  the  ML  or  the  MAP  criterion.  Minimizing  the  MRE  criterion  is 
usually  difficult;  thus,  the  iterative  alternate  minimization  (or  coordinate  search)  algorithm 
may  be  suggested.  In  the  special  case  where  the  MRE  criterion  reduces  to  the  ML  criterion, 
this  alternate  minimization  algorithm  reduces  to  the  EM  algorithm,  where  minimizing  with 
respect  to  one  density  is  equivalent  to  the  E  step,  and  minimizing  with  respect  to  the  other 
density  is  equivalent  to  the  M  step. 

As  already  mentioned,  the  minimization  of  the  relative  entropy  by  alternate  minimiza¬ 
tion  and  its  relation  to  the  maximum  likelihood  criterion  were  originally  suggested  in  |7{ 
and  ’8'.  The  alternative  minimization  method  and  its  properties  were  also  developed  in 
73),  where  an  explicit  relation  to  the  EM  algorithm  was  established. 
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6.2.1  MRE  by  alternate  minimization 

A 3  discussed  above,  the  goal  of  any  statistical  inference  process  is  to  find  a  probability 
distribution,  that  explains  the  observed  phenomenon  under  some  a-priori  knowledge  and 
constraints.  Suppose  that  the  observations  and  the  a-priori  knowledge  may  be  summarized 
into  constraints  on  the  desired  probability  density  functions,  i.e.  that  p(x)  €  p  and  q(x)  <=  Q 
where  P,  Q  are  sets  of  p.d.f.  A  version  of  the  MRE  criterion  will  solve  our  inference  problem 
via  the  following  functional  minimization: 

p(z),  4(z)  =  arg  min  -V  (p(x);  ?(x))  =  arg  min  f  p(x)  log  *^\dx  (6.11) 

t*)€i  p(z)€P.i[z}ea  Jx  ?(x) 

The  minimization  of  the  relative  entropy  in  (6.11)  is  complicated  in  general.  It  is  also  a 
functional  optimization  problem;  thus,  numerical  methods  cannot  be  easily  applied  either. 
To  solve  this  problem,  an  alternate  minimization  method  (or  coordinate  search  algorithm)  is 
suggested.  In  this  method,  we  generate  a  sequence  of  solutions  {p^n,(z),  ^^(x)}  as  follows: 

•  Start:  guess  p(0,(x).  q'0)(x) 

•  Iterate,  until  some  convergence  criterion  is  achieved, 

p(n+1)(z)  =  arg  min  M  (p(z);  q(n,(x))  (6.12) 

?("*‘>(x)  =  argrnin.V  (p(n~l|(x);  ?(x))  (6.13) 

Any  alternate  minimization  algorithm  has  the  desired  monotonicity  property.  Thus, 
each  iteration  improves  (in  our  case  decreases)  the  goal  function.  Specifically, 

V  (pS"-‘>(x);,<'-,>(x))  <  M  (p‘"-‘>(x);,<">(x))  <  *  (p‘">(x);  f‘">(x))  (6.14) 

The  monotonicity  property  implies  that,  if  the  goal  function  is  bounded,  then  it  also 

converges  to  some  value  H' .  A  continuous  function,  tike  the  minimum  relative  entropy,  will 
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be  bounded  when  it  is  defined  over  a  compact  set.  For  our  iterative  algorithm,  in  order  to 

* 

show  that  K(pln);  9^r,))  —  V',  it  is  sufficient  to  show  that  the  set  P0  x  Qo,  where 

?o  <  io=  { (pe  °.q€  a  Hp:q)  <  .V (p‘°>;  9‘°>) }  (6.15) 

is  compact. 

Differentiability  of  the  goal  function  will  imply  that  H’  is  a  stationary  value.  If  the  sets 
*  and  £  are  convex,  then  the  convergence  point  H"  is  a  global  minimizer.  Unfortunately, 
in  the  special  case  where  the  desired  density  functions  are  defined  parameterically,  these 
sets  are  rarely  convex,  even  when  the  possible  set  in  the  parameter  space  is  convex. 

A  comprehensive  discussion  on  the  properties  of  this  algorithm,  especially  the  conver¬ 
gence  issues,  may  be  found  in  '8)  pp.  107-133  and  in  [73J. 

6.2.2  ML  as  a  special  case  of  the  MRE  criterion 

Let  X  be  a  sample  space,  referred  to  as  the  complete  data  sample  space,  and  suppose 
that  a  parametric  family  of  probability  distributions  is  defined  over  it.  This  parametric 
family  is  indexed  by  the  vector  6  €  ©,  where  0  is  a  subset  of  the  k-dimensional  Euclidean 
space.  Thus,  the  set  of  possible  complete  data  densities  is 

fie  =  {*(*;«)  *£©}  (6-16) 

Suppose  that  we  do  not  observe  j  €  -V,  but  instead  we  observe  an  incomplete  data  g  = 
T[i),  where  T(-)  is  a  many-to-one  mapping.  The  new  sample  space,  given  the  incomplete 
data  observations,  is  X  (g)  as  in  (2.4).  The  possible  p.d.f.,  given  jj,  are  therefore  constrained 
to  the  set 

9v  =  |p('):p( i)  =  0  Vj  g  f  p(f)dj=l|  (6.17) 

l  JUv)  ) 
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Note  that  a  sample  space,  Y ,  for  the  observation  also  exists,  and  each  9  defines  a  p.d.f.  over 
this  space,  given  by 

fy(y;t)  =  /  q(z\i)dx  (6.18) 

■'I(v) 

The  probability  densities,  q(-)  €  and  p(-)  €  are  estimated  via  the  minimum 
relative  entropy  criterion.  Thus,  we  have  to  solve 

p(l),?(l)  =  arg  min  f  p(x)iog^ds  (6.19) 

P(S)€^».  «(*)€£»  J X  9V£) 

Note  that  estimating  q(-)  €  £&  is  equivalent  to  estimating  9  €  9,  i.e.  we  have  to  solve 

p(s).iw*£  =  arg  rrnn  f  p(i)log-^rdj  (6.20) 

p{i)€Pvf€9Jx  q(t<9) 

where  9MRE  denotes  the  minimum  relative  entropy  estimator  of  the  parameters. 

The  ML  criterion  determines  the  estimate  9  €  9  by  maximizing  the  likelihood  of  the 
observation,  i.e. 

Iwl  =  arg  max  fY(yi;  9)  =  arg  max  /  q(x;  9)dt  (6.21) 

?  - “  J  X(y) 

We  will  now  show  that  estimating  9  by  the  minimum  relative  entropy  (i.e.  by  6.20)  is 
equivalent  to  estimating  9  by  maximum  likelihood  (i.e.  by  6.21). 

For  any  fixed  9  we  will  minimize  (6.20)  with  respect  to  p(x).  This  minimization  problem 
may  be  solved  explicitly  using  the  following  lemma,  which  is  a  direct  result  of  the  convexity 
of  the  relative  entropy  function. 


Lemma  6.1  For  any  measurable  set  A 

vP(l) 


Lpii)PMii2P{!i)'°sw) 

where  P{ A)  =  /Ap(j)di  and  Q{\)  =  /A?(i)dj.  Equality  holds  if  and  only  if 


PU)  l(l) 
P(&)  Q{\)' 


a.e  in  A 


(6.22) 


(6.23) 
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The  proof  of  this  lemma  may  be  found  in  8j  page  373. 

Let  A  be  the  set  X(y)-  In  this  case.  P(T(^))  =  1  and  Q(X(y))  =  /V(g;g).  By  the 
lemma  above,  a  probability  density  function  that,  will  satisfy  (6.23),  will  minimize  the 
relative  entropy.  Thus  the  p(j)  that  minimizes  (6.20)  for  any  fixed  9  is 

M  =  *XM> ■  -  A/rte/sO  (««> 

which  is  the  conditional  probability  of  f  given  g.  The  value  of  the  relative  entropy  at  this 
point  is  given  again  by  the  lemma  above  (6.22),  i.e. 

P(X(S!))log^|||||  =  -log/K(if;«)  (6-25) 

The  relations  above  and  the  equivalence  of  the  minimum  relative  entropy  estimator, 
9.mr£<  and  the  maximum  likelihood  estimator,  9ML,  are  summarized  in  the  following  equa¬ 
tion: 

Imre  =  ars  nun  min  [  p(l)  log =argmin  -  log  /y(j/;  0)1  =  9ML  (6.26) 
£€©  |p(s)€ siJ i(v)  9(2. S.)  ?ee  • 

6.2.3  The  EM  as  an  alternate  minimization  algorithm 

Since  the  MRE  criterion  reduces  to  the  ML  criterion  in  the  special  case  summarized 
above,  applying  the  alternate  minimization  algorithm  of  (6.12)  and  (6.13),  will  provide  an 
iterative  algorithm  for  maximum  likelihood.  This  iterative  algorithm  is  the  EM  algorithm, 
as  shown  below. 

In  each  step  of  the  alternate  minimization  algorithm,  we  have  the  current  estimates, 
p(n,(l)  and  9(n)(i)  =  9(1  ;&"*),  of  the  p.d.f.’s.  Applying  first  (6.12),  we  get 

P!n+l)  =  arg  min  X  (p(l)\q{t\i{n]))  =  fx/YU/$l{n)) 

r(i)€Ag)  '  ' 


(6.27) 


where  this  explicit  solution  is  based,  again,  on  lemma  6.1.  Now  applying  the  second  step, 
i.e.  (6.13),  we  get 


0(n~°  =  argmin  .V  (p(n~l)(x);  ?(j; £))  =  argmin  f  /*/ rU/y;  2(n))  log 


fx/Y(i/y,  g(n>) 
»(*;$) 

(6.28) 


Using  the  notations  of  (2.11),  we  may  write 


1 


^(n+i) 


=  argmin/f($(n,,$(B))  -  Q(tt{n)) 


argmaxg(£;  g<"’) 


(6.29) 


Equation  (6.29)  is  exactly  an  iteration  of  the  EM  algorithm  ! 


6.2.4  Remark:  how  not  to  use  the  MRE  criterion 

The  relative  entropy  is  sometimes  interpreted  as  a  distance  measure  between  two  prob¬ 
ability  distributions:  the  “Kullback-Leiber”  measure.  However,  it  lacks  one  of  the  desired 
features  of  a  distance  measure,  namely,  it  is  not  symmetric, 

#(E''  j)  *  y(£;g)  (6  30) 

Furthermore,  following  the  common  rationale  of  the  MRE  method,  the  relative  entropy  has 
meaning  only  for  comparing  possible  probability  measures,  g,  given  an  a-priori  assignment, 
g.  Thus,  we  prefer  to  interpret  the  relative  entropy  as  representing  the  conditional  likelihood 
of  an  assignment  g,  given  the  assignment  g,  as  summarized  in  equation  (6.9),  i.e. 

U(E’  i)  ~  ~  l°S  Pr<*(Rl l)  (6  31) 

Having  this  interpretation,  it  makes  sense  to  minimize  the  relative  entropy  with  respect 
to  the  first  argument,  g,  only.  Unfortunately,  the  alternative  derivation  of  the  EM  algorithm 
is  based  on  minimizing  the  relative  entropy  with  respect  to  both  g  and  g.  Thus,  with 
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respect  to  this  derivation,  we  agree  that  for  any  given  probability  assignment,  fx(z),  for 
the  complete  data  X,  the  best  assignment  over  the  set  X[y)  is  the  conditional  density, 
fx/rU/y)-  However,  minimizing  then  the  relative  entropy  with  respect  to  fx(x),  in  this 
case,  is  not  justified. 

This  remark  does  not  detract  from  the  mathematical  elegance  and  the  additional  insight 
that  may  be  gained  by  the  alternative  derivation  of  the  EM  algorithm,  but  it  does  suggest 
that  justifying  the  maximum  likelihood  criterion  via  this  relation  is  a  poor  use  of  the  MRE 
criterion.  , 


6.3  Minimum  Description  Length  interpretation  of  the  MRE 
criterion 

Despite  the  criticism,  the  MRE  criterion  is  used  and  justified  in  several  statistical  prob¬ 
lems.  It  can  estimate  an  entire  probability  distribution  function.  It  is  also  the  basis  of  the 
alternative  derivation  of  the  EM  algorithm.  Thus  an  additional  interpretation  of  the  MRE 
method  is  desirable. 

After  a  brief  review  of  the  philosophical  idea  of  the  Minimum  Description  Length  (MDL) 
and  the  Minimum  Information  (MI)  criteria,  we  will  prove  the  main  result  of  this  chapter, 
namely,  that  the  MRE  criterion  is  a  special  case  of  the  MDL  criterion,  in  a  certain  context. 
On  one  hand,  this  result  clarifies  the  appropriate  context  in  which  the  MRE  should  be  used. 
On  the  other  hand  it  motivates  and  supports  the  MDL  criterion  by  showing  that,  in  the 
appropriate  special  case,  it  reduces  to  another  proven  criterion. 

The  idea  of  complete  and  incomplete  data  specifications  that  is  so  important  in  the 
development  of  the  EM  algorithm,  also  plays  an  important  role  in  the  definition  and  the 
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proof  of  this  relationship  between  the  MDL  and  MRE  criteria.  For  showing  the  relation 
between  the  MDL  and  the  MRE  criteria,  the  MDL  criterion  is  used  in  a  mode  where  it 
considers  a  set  of  possible  observation.  An  immediate  example  for  such  situation  is  the 
set  Z(y)  of  the  possible  complete  data,  given  an  observation  g,  which  is  used  in  the  EM 
algorithm  context. 

6.3.1  The  Minimum  Description  Length  idea 

We  have  already  mentioned,  in  Chapter  2,  the  Minimum  Description  Length  criterion, 
suggested  by  Rissanen  29,30,311  and  the  more  general  Minimum  Information  criterion,  sug¬ 
gested  originally  by  Solomonoif  [21)  and  recently  by  Hart  [22).  The  philosophical  foundation 
of  the  MDL/MI  methods  is  the  claim  that  the  most  compact  description  of  the  observa¬ 
tion  provides  the  best  explanation  of  the  phenomena  we  observe.  In  other  words,  if  we 
have  the  best  method  to  encode  or  compress  the  observation,  we  have  actually  estimated 
the  probability  distribution  that  ‘explains”  the  data  best.  This  philosophy  is  intuitively 
reasonable  and  is  consistent  with  universal  philosophical  principles,  such  as  the  Ockham 
Razor  principle.  We  strongly  believe  in  these  “principles  of  parsimony”,  and,  since  these 
principles  can  be  made  precise  by  the  quantitative  measures  of  information  and  complexity, 
we  strongly  advocate  their  use. 

Using  these  criteria,  we  may  overcome  some  of  the  limitations  of  the  ML  method  and 
generate  methods  that  can  accept  less  restrictive  modeling  assumptions.  For  example,  the 
ML  method  fails,  when  the  number  of  parameters  is  unknown,  since  the  more  parameters 
we  choose,  the  larger  the  likelihood  can  be.  In  this  case,  a  specific  application  of  the  MDL 
criterion  tries  to  estimate  the  parameters  together  with  their  number  by,  (see  also  (2.73) 
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and  31) 


n.0  =  arg  mini- log p(z\  9)  -n  log  N] 

flJ  '  Z 


(6.32) 


where  X  is  the  length  of  the  observation  sequence. 

In  the  MDL  method,  the  abstract  principle  of  shortest  description  is  translated  into 
a  mathematical  criterion  in  the  following  way.  The  description  (or  code)  length  above 
is  influenced  by  two  factors.  If  we  knew  the  probability  distribution,  the  “ideal”  code 
length  761 ,  that  is  required  to  represent  the  specific  observation,  is  the  (self)  information 
of  the  observation,  i.e.  -  logp(x).  The  second  factor,  |nlog.V,  is  the  code  length  needed 
to  represent  the  model  or  the  parameters,  considering  the  precision  required  for  encoding 


continuous  parameters. 

To  show  that  the  ME  and  MRE  methods  are  special  cases  of  the  MDL  criterion,  we  have 


to  extend  the  MDL  method  somewhat.  Suppose  that  the  information  given  about  the  un¬ 
derlying  phenomena  is  not  a  single  observation  sequence,  but  rather  a  set  of  such  sequences. 
This  type  of  information  is  available  either  by  having  several  independent  observation  se¬ 
quences  or  by  having  constraints,  that  define  a  possible  set  of  observation  sequences.  The 
MDL  criterion  for  this  type  of  information  will  suggest  that  we  choose  the  probability  dis¬ 
tribution  that  minimizes  the  weighted  combination  of  ail  code  lengths,  by  some  a-prion 
weight  q(x),  where  ail  members  of  this  set  of  possible  observation  sequences  are  encoded 
using  the  proposed  distribution. 

We  can  now  adapt  the  MDL  criterion  to  the  MRE  framework,  in  which  the  given 

r 

information  about  the  underlying  phenomena  is  in  terms  of  constraints  on  the  probability 
distribution.  We  claim  that  if  we  try  to  represent  all  possible  observation  sequences,  whose 
“histogram”  or  sample  frequencies  satisfy  those  constraints,  the  minimum  weighted  code 
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length  is  achieved  by  the  V1RE  probability  distribution. 

6.3.2  Minimum  Relative  Entropy  as  Minimum  Description  Length 

In  the  MRE  framework,  the  given  information  is  the  knowledge  of  some  averages.  Now 
recall  that  the  strong  law  of  large  numbers  implies, 

1  n 

E\g(x) )  =  g  <=>  Jirn  -  ]T  g(xt)  =  g  a.s.,  (6.33) 

»=  l 

where  x,  are  i.i.d.  observations,  distributed  as  z.  So,  this  MRE-type  information  is 
equivalent  to  the  information  that  the  observations  lie  in  the  set  of  all  infinitely  long  se¬ 
quences  whose  sample  averages  equals  the  given  averages. 

Considering  the  above  argument,  we  will  show  that  minimizing  the  relative  entropy, 
subject  to  some  constraints  on  the  probability  distribution,  is  asymptotically  equivalent  to 
minimizing  the  weighted  combined  code  length  needed  to  represent  all  the  sequences,  whose 
'‘histogram”  or  sample  frequencies  satisfy  the  given  constraints. 

To  clarify  our  argument,  let  us  start  with  a  simple  example.  Suppose  we  want  to  estimate 
the  probability  of  “1”  (success)  in  a  simple  binary  (Bernoulli)  trial.  We  denote  p(“  1”)  =  9 
and  p(“0”)  =  1  -  6.  Suppose  we  do  not  have  any  observations,  so  we  only  know  that  for 
any  .V  trials,  that  we  will  perform,  we  may  observe  any  of  the  2'v  possible  sequences  of  “l"s 
and  “0”s. 

Equipped  with  the  Minimum  Description  Length  philosophy,  and  applying  a  uniform 
weight  to  all  code  lengths,  we  will  choose  9  so  that  all  2N  sequences  can  be  represented  by 
the  shortest  possible  code.  Now,  for  each  sequence  x  we  need  about 

-  log  p(x;9)  =  -  logj0*(l  -  #)*"*!  (6.34) 
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which  is  minimized,  as  expected,  by  9  —  1/2.  Observe  that  this  probability  function  is  the 
same  as  that  given  by  the  Maximum  Entropy  principle  (or  the  MRE  principle  with  uniform 
prior)  with  no  constraints,  on  the  binary  random  variable. 

Notice  that  here  we  have  ignored  the  term  1/2  log  ;V,  required  to  represent  the  code 
length  for  describing  the  single  parameter  9,  because  it  has  no  effect  on  the  minimizing 
value. 

We  are  ready  now  to  prove  the  general  claim,  stated  as  the  following  theorem. 


Theorem  6.1  Let  X  be  a  random  variable  that  takes  its  values  over  the  finite  set  (1,- ••  ,m}. 
Let  i  =  Z\Zi  ■■■xs  be  a  sample  of  X  independent  trials  of  X.  Let  /,( i)  =  k,(z)/N  be  the 
frequencies  of  each  outcome  in  this  sample.  The  vector  f(t)  =  Ji(s),  •  ■  • ,  fm(&)\T  will  be 
called  the  histogram  of  the  sample.  Let  “  be  any  fixed  set  of  histograms.  Let  Xs  6e  the  set 

Xs  =  {***!•••**  I  l(i)€  '}• 
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Let  the  weighted  code  length,  with  the  weights  j  =  [q\,  •  •  • ,  gmj,  that  results  when  the  whole  set 
Xy  is  encoded  by  a  code  book  designed  according  to  g  =  [pi, •  •• ,  Pm],  be  denoted  L(Xy,p,  j) 
where 

L(XN,p,  j)=  £  ?(l)[-logp(j)j 

Then  the  probability  assignment  g  =  pi,---,Pmj  that  minimizes  L(Xy,p,  q)  is, 


i ?»  -  fir  .v 


9l  ?«•  f 

w/€/  "xrxr  I 

e‘v  ■  ^  .1-  .V 

L/er  *,!-*.! 


(6.37) 


Furthermore ,  a$  ;V  — >  oo 


g  =  VU™0  2jv  =  argmin/f(g;£)  =  "S™n^p*log^  (6.38) 

Proof:  The  code  length  required  to  represent  a  sample,  t  =  •  •  •  xy ,  in  a  code  book 

designed  by  g  (within  the  term  m/2  log  JV  required  to  encode  the.  probabilities  p,)  is, 

i(x) = -  iogp(i) = -  io*  ( n  Pi' )  =  -  n *»■  (6-39) 

v«=t  /  t=l 

where  Jfc,  is  the  number  of  occurrences  of  the  t‘A  outcome  in  x.  The  weighted  code  length  is 


given  by 


L(T/v,g,j)=  X!  ft'  •••*£  (-£*Jogp<] 
£€XW  \  •=!  > 


(6.40) 


Now,  there  are  k^\m ?  possible  sequences  having  the  same  frequencies  or  the  same 
number  of  occurrences,  k  =  jJfct,  •  •  • ,  km]T .  Since  the  constraints  are  only  on  the  frequencies 
(or  on  k),  we  can  write  the  weighted  code  length  as, 

E  ,ft;r •  -£  (e  Wa 

adnuJitW*  J  *1-  *"»•  V  i  =  l  /  i  =  l  \/6/  *1-  y 


We  will  denote 


rf1 


4  = 


(6.41) 


(6.42) 
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The  weighted  code  length  is  thus 


L(Xn,£-  j)  =  -£a  logpi 

1=1 

which  is  minimized  (using  Jensen’s  inequality)  by 


(6.43) 


PN.  i  = 


ft 


(6.44) 


ES,A 

Substituting  (6.42)  in  (6.44)  and  recalling  that  fc/  E”  i  ki  =  /,,  yields  (6.37). 

In  general,  we  will  get  the  MRE  distribution  only  in  the  limit  as  N  — ►  oo  as  follows. 
Following  the  derivation  of  (6.7)  and  (6.8),  we  get 


?i*‘  -V!  -  -/vfv(/.„)^!^)l 


(6.45) 


where  V(/;  q)  =  E.”i  /,  log  £*■  is  the  relative  entropy  between  the  frequencies  /,  =  k,/N 
and  q ,.  Substituting  (6.45)  in  (6.37)  and  taking  the  limit  as  .V  —  oo  yields 

Zf  erf I> 

P  =  lim  — f - (6.46) 

Let  us  assume  that  the  function  k(/;j)  has  in  7  a  single  global  minima,  at  £  .  Now 

as  .V  — *  oo 


lim  = 

,V— oo 


0  «7Z*Z„ 
i  «7  Z  =  L 


so,  we  can  write  (6.46) 


Zfz?  Le~s  M{L'l)  „  E/€?Ze"v!*(-i,"'(^,"'f)i 

•v1^  r,6.e-'V»(/;2)  =  J“5.  ~ 

i.e.  g  is  the  MRE  distribution.  □ 


(6.47) 


(6.48) 
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Note  that  in  general,  if  k  (•:  j)  haa  several  global  minima,  then  the  result  in 


I 


(6.46)  will  be 


«  n 

*  -  zT.L 


t=i 


However,  since  H{-\ j)  is  convex,  whenever  the  constraint  set  is  convex,  there  is  only  a  single 
minimum. 

We  claim  that  the  above  theorem  can  be  extended,  following  the  same  lines  of  proof, 
to  the  case  where  the  random  variable  takes  its  values  over  an  infinite  set.  The  relation 
between  the  MDL  and  the  MRE  criteria  is  thus  fully  established. 
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Chapter  7 


Conclusions 


and  further  research 


In  this  final  chapter,  we  will  summarize  and  discuss  the  results  and  the  contributions  of 
the  thesis  and  try  to  convey  a  general  philosophy  for  solving  signal  processing  problems  that 
may  be  established  from  this  thesis.  Before  that,  we  will  suggest  further  research  directions. 
A  particularly  interesting  suggestion  for  further  research,  for  which  we  have  specific  ideas,  is 
the  application  of  the  EM  method  to  the  the  problem  of  signal  reconstruction  from  partial 
information. 


7.1  Further  research 

Many  research  directions  may  be  suggested  to  complete  and  extend  the  work  presented 
in  this  thesis.  Interesting  theoretical  problems  as  well  as  interesting  signal  processing  ap¬ 
plications  may  be  explored  further.  In  this  section,  we  wi.M  indicate  a  few  of  these  research 
projects.  We  will  start  by  discussing  an  important  subject,  that  has  not  been  explored 
enough  in  the  thesis,  namely,  the  analysis  and  the  applications  of  the  sequential  and  adap¬ 
tive  EM  algorithms. 
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We  have  developed  specific  ideas  and  mathematical  formulations  for  solving  problems 
of  signal  reconstruction  from  partial  information,  in  the  EM  context.  These  ideas  should  be 
investigated  further  and  should  be  applied  to  real  reconstruction  problems.  We  will  present 
these  ideas  and  the  further  proposed  research  in  the  next  section. 

7.1.1  The  sequential  and  adaptive  algorithms 

Considering  the  theory  of  the  sequential  EM  algorithm,  we  suggest  the  following  research 
directions: 

•  Non  asymptotic  statistical  analysis 

We  have  considered  only  limit  distribution  and  consistency.  However,  interesting 
questions  arise  in  the  non-asymptotic  case.  A  complete  analysis  of,  say,  the  variance 
of  the  estimator  as  the  iteration  proceeds  is  desirable. 

•  Rate  of  convergence  and  tracking 

A  topic  that  is  close,  but  not  identical,  to  the  previous  topic,  is  the  convergence  rate  of 
the  sequential  and  adaptive  algorithms.  For  the  adaptive  algorithms  we  are  interested 
in  the  tracking  capabilities,  which  improve  if  the  algorithm  converge  faster. 

•  Limit  distribution  for  non  i.i.d.  case 

This  research  topic  will  complete  the  results  presented  in  Chapter  3. 

•  Other  approximations 

We  have  suggested,  in  Chapter  3,  sequential  EM  algorithms,  based  on  a  specific 
approximation  of  the  batch  EM  algorithm.  However,  there  may  also  be  other  possible 
approximations  that  should  be  investigated. 
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The  sequential  and  adaptive  algorithms  developed  in  the  thesis  have  not  been  applied 
yet,  in  a  serious  way,  to  signal  processing  problems.  As  a  first  step,  we  suggest  that  these 
algorithms  be  applied  extensively  to  the  problems  that  have  been  solved  in  Chapters  4  and  5 
of  the  thesis,  via  the  batch  EM  algorithm.  However,  other  signal  processing  problems  also 
call  for  adaptive  solutions. 

We  may  want  to  start  with  a  simpler  example.  Consider  the  problem  of  a  stationary 
signal  in  a  stationary  noise,  where  the  suggested  batch  EM  algorithm  is  the  iterative  Wiener 
filter  algorithm.  Assume  that,  given  the  signal  without  the  noise,  the  parameters  may  be 
estimated  sequentially,  say,  by  the  RLS  algorithm.  It  will  be  interesting  to  try  a  recursive 
algorithm  that  uses  a  Kalman  filter  (instead  of  a  Wiener  filter)  to  get  the  signal,  and  then 
use  this  signal  recursively,  to  estimate  the  parameters. 

Other  examples,  that  come  to  mind,  are  parameter  estimation  of  dynamic  systems,  the 
problem  of  tracking  the  trajectories  of  multiple  targets  using  tracking  radar,  analysis  of 
sequences  of  images  and  so  on. 

7.1.2  Other  research  directions 

Here,  we  will  briefly  present  other  research  directions  that  come  to  mind,  both  at  the 
theoretical  level  and  the  signal  processing  application  level.  We  will  start  with  the  theoret¬ 
ical  research  directions: 

•  Global  Optimization: 

The  EM  algorithm  can  guarantee  convergence  only  to  a  stationary  point  of  the  like¬ 
lihood.  One  might  investigate  the  combination  of  the  EM  algorithm  with  standard 
methods,  summarized  in  the  book  [77],  Recently,  a  new  technique,  the  simulated 


annealing  78 i ,  has  become  increasingly  popular.  The  CM  algorithm  ideas  may  be 
combined  with  this  technique  to  achieve  an  algorithm  for  global  optimization. 

•  The  EEM  algorithm: 

The  idea  of  changing  the  complete  data,  presented  in  Chapter  2,  should  be  investigated 
further.  The  rules  suggested  for  varying  the  complete  data  could  be  stated  more 
1  concisely  and  their  properties  should  be  analyzed.  This  research  direction  may  be 

combined  with  the  previous  one,  since  one  of  the  motivations  for  varying  the  complete 
data  is  to  escape  from  unwanted  stationary  points. 

•  EM  algorithms  for  general  estimation  criteria: 

We  have  presented  in  the  thesis  the  EM  method  for  a  class  of  estimation  criteria. 
Further  research  may  extend  this  class.  The  properties  of  this  method  for  general 
estimation  criteria,  should  be  further  explored. 

•  Other  iterative  algorithms  for  ML: 

Recently,  other  iterative  algorithms  for  ML  estimation,  given  incomplete  data,  have 
been  suggested  by  statisticians,  based  on  some  ideas  from  the  EM  theory,  e  g.  [14]. 
These  algorithms  should  be  explored  further;  their  sequential  versions  could  be  devel¬ 
oped  and  applied  to  signal  processing  problems. 

Applications 

Many  further  signal  processing  applications  of  the  EM  method  can  be  considered.  We  will 
briefly  present  here  some  examples. 

•  Separating  a  narrow  band  signal  from  a  wide  band  signal 
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This  problem  occurs  in  many  practical  situations,  such  as  the  problem  of  enhancing 
the  periodic  acoustical  signal  of  an  helicopter  in  a  wide  band  noise  of  a  jet  plane.  This 
problem  is  analogous  to  the  problem  of  filtering  a  stationary  signal  from  a  stationary 
noise,  solved  using  the  iterative  Wiener  filter.  However,  in  this  problem,  we  cannot 
model  the  narrow  band  signal  as  a  stationary  Gaussian  signal.  Modeling  this  problem 
correctly,  maybe  with  the  EM  algorithm  in  mind,  and  solving  the  resulting  statistical 
problem,  is  an  interesting  topic  for  further  research. 

•  Separating  two  speakers:  The  “Cocktail  Party”  problem: 

This  problem  is  analogous  to  the  previous  problem.  However,  we  do  not  expect  to 
gain  much  by  modeling  the  speech  signal  as  a  Gaussian  stationary  signal,  since  the 
spectral  distribution  of  both  speakers  is  identical.  We  suggest  that  by  modeling  the 
speech  signals  using  their  periodic  nature,  we  may  be  able  to  distinguish  the  speakers 
based  on  their  different  periodicities  and  phases.  The  resulting  statistical  problem 
may  be  complicated.  However,  it  might  be  solved  using  the  EM  method. 

•  Joint  estimation  of  pitch  and  spectral  parameters  of  speech: 

Usually  the  pitch  and  the  LPC  parameters  of  the  speech  signal  are  estimated  indepen¬ 
dently.  Furthermore  the  LPC  parameters  are  estimated  by  modeling  the  speech  signal 
as  a  stationary  AR  process,  a  model  that  is  clearly  not  adequate  for  voiced  speech. 
We  suggest  using  the  pulse  excitation  model.  We  will  choose  the  complete  data  to 
be  the  pulse  train,  modeled  as  a  stochastic  point  process,  in  addition  to  the  observed 
speech  signal.  An  EM  algorithm  for  estimating  the  pitch  and  LPC  parameters  might 
be  suggested,  using  this  complete  data. 
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Following  che  guidelines  presented  in  section  2.6  and  these  examples,  many  more  signal 


processing  application  could  be  suggested. 


7.2  Signal  reconstruction  from  partial  information:  Ideas 
and  proposed  research 

In  this  section,  we  will  present  specific  ideas  and  the  mathematical  formulation  for 
solving  problems  of  signal  reconstruction  from  partial  information,  using  statistical  modeling 
and  the  CM  algorithm. 

7.2.1  General  discussion 

The  problem  of  signal  reconstruction  from  partial  information  has  been  investigated 
by  many  researchers,  e  g.  79,80,81,82!  to  mention  a  few.  Traditionally,  a  deterministic 
formulation  of  this  problem  has  been  adopted,  where  the  given  partial  information  provided 
(non-linear)  deterministic  equations  and  constraints  for  the  unknown  signal  samples.  A 
major  research  effort  was  allocated  to  answer  the  questions  of  existence  and  uniqueness 
of  a  solution,  and,  as  a  result,  statements  such  as  “Phase  retrieval  is  impossible  for  one 
dimensional  signal  however  it  is  possible  for  a  two  dimensional  signed” .  were  declared.  Other 
research  efforts  led  to  algorithms  which  perform  the  reconstruction  task  by  finding  solutions 
that  satisfy  the  constraints  and  the  equations,  either  directly,  e.g.  [83] ,  or  via  iterative 
procedures,  e.g.  ;5i,[84|,  [85|. 

This  deterministic  approach  assumes,  at  least  implicitly,  noiseless  measurements.  The 
effects  of  the  noise,  which  may  result  from  measurement  and  computation  errors,  were 
considered  only  by  investigating  the  robustness  of  the  algorithms  designed  for  the  noiseless 
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case.  It  has  been  observed  that  some  reconstruction  problems,  for  which  a  solution  exist, 
have  shown  poor  noise  immunity,  and  thus  are  probably  ill-posed  and  practically  unsolvable 
in  this  deterministic  framework.  For  example,  the  reconstruction  of  a  two  dimensional 
signal  from  its  Fourier  transform  magnitude  is  an  open  problem,  despite  the  fact  that  a 
unique  solution  to  this  problem  exists.  In  general,  one  can  calculate  condition  numbers 
for  deterministic  reconstruction  algorithms  and  predict  his  chances  of  solving  a  specific 
reconstruction  problem  in  a  real  situation. 

We  will  suggest  a  statistical  formulation  of  the  reconstruction  problem.  In  this  for¬ 
mulation,  we  model  the  noise,  measurement  or  computation  noise,  naturally.  The  signal 
reconstruction  problem  becomes  a  statistical  estimation  problem,  for  which  well  known  per¬ 
formance  bounds,  like  the  Cramer-Rao  bound,  exist.  These  performance  bounds  play  the 
role  of  the  condition  numbers  in  the  deterministic  formulation.  However,  the  statistical 
performance  bounds  provide  more  information  and  insight. 

The  performance  of  a  reconstruction  problem  can  be  improved  by  incorporating  a-priori 
information  about  the  signal.  An  important  advantage  of  the  statistical  formulation  is  that 
a  wide  class  of  a-priori  information  may  be  easily  incorporated.  We  note  that  in  the  deter¬ 
ministic  formulation,  regularization  methods  were  suggested  in  an  attempt  to  improve  the 
performance.  Some  of  the  methods  to  regularize  ill-posed  deterministic  problems  are  equiv¬ 
alent  to  assigning  simple  a-priori  probabilities  to  the  signal  in  the  stochastic  formulation. 

The  statistical  problems,  in  the  proposed  formulation,  require  maximizing  the  likeli¬ 
hood,  the  a-posteriori  probability  or  other  statistical  criterion  depending  on  the  a-priori 
information.  These  problems  are  naturally  solved  using  the  EM  method.  Since  the  obser¬ 
vations  are  partial  and  distorted,  the  complete  data  is  the  undistorted  signals.  For  some  ML 
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problems,  where  no  special  a-priori  information  is  incorporated,  the  resulting  EM  algorithm 
is  equivalent  to  standard  iterative  algorithms  such  as  Gerchberg-Saxton  algorithm  j5j,  and 
thus,  unfortunately,  has  a  similar  poor  performance  in  ill-posed  reconstruction  problems. 
However,  we  believe  that  open  reconstruction  problems  may  become  well  behaved  by  us¬ 
ing  statistical  models  that  incorporate  enough  realistic  information.  The  EM  method  will 
suggest  a  practical  solution  to  the  resulting  statistical  problems. 

7.2.2  Statistical  formulation  of  signal  reconstruction  problems 

The  first  element  in  the  suggested  statistical  formulation,  is  a  definition  of  the  quantities 
that  should  be  estimated  and  inferred.  These  quantities  may  be  the  signal  samples,  the 
samples  of  some  underlying  “hidden”  process  or  some  unknown  parameters.  We  will  denote 
these  quantities  by  the  vector  a. 

The  next  element  in  the  suggested  framework  is  the  definition  of  a  stochastic  process, 
denoted  j,  that  depends  on  the  desired  quantities  in  a  simple  way,  i.e. 

i  =  Hi)  (7.1) 

where  7  is  a  stochastic  function,  which  defines  a  (simple)  conditional  probability,  /^/^Gt/l). 
If  j  is  observed,  s  may  be  estimated  easily.  We  admit  that  the  statistical  formulation  is 
defined  having  the  EM  idea  in  mind;  this  stochastic  process  will  be  used  as  the  complete 
data  or  as  part  of  the  complete  data  in  the  suggested  EM  algorithm. 

Another  element  is  the  measurement  procedure.  In  defining  this  element,  we  will  model 
the  partial  information  aspect  and  the  measurement  noise  aspect  of  the  real  reconstruction 
problem.  Denote  the  observation  by  g.  We  may  write, 


l  =  H{  j)  -t- 
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(7.2) 


where  #(•)  is  a  non-invertibie  transformation  representing  the  fact  that  only  partial  infor¬ 
mation  is  available.  The  vector,  v,  represents  the  measurements  errors.  With  a  probabilistic 
description  of  v,  we  achieve  a  stochastic  description  of  the  observations  in  terms  of  the  de¬ 
sired  quantities,  via  the  conditional  probability,  /y^/s(jf/s)- 

Using  the  stochastic  description  of  the  observations  and  an  appropriate  statistical  cri¬ 
terion,  determining  the  desired  quantities  reduces  to  solving  a  mathematical  optimization 
problem.  If  this  problem  is  ill- posed,  then  additional  information  or  assumptions  and  maybe 
a  different  statistical  criterion  should  be  considered.  In  the  statistical  framework,  the  a- 
priori  knowledge  about  the  desired  quantities  can  be  easily  incorporated,  in  a  quantitative 
manner.  The  a-priori  knowledge  will  also  define  the  statistical  criterion  which  will  be  used. 
This  element  of  the  statistical  formulation  is  important,  since,  by  incorporating  additional 
information,  it  is  possible  to  solve  real  reconstruction  problems,  which  are  ill-posed  other¬ 
wise. 

We  will  now  present  three  examples  of  different  statistical  models,  that  follow  the  de¬ 
scriptions  above.  All  three  models  may  be  used  for  solving  reconstruction  problems.  In  the 
first  example,  the  desired  quantities,  s,  are  the  samples  of  the  signal  or  the  pixels  of  the 
image  to  be  reconstructed.  Assuming  a  simple  stochastic  model,  x  is  distributed  normally 
with  mean  F  ■  3  and  variance  a2 1,  where  F  is  an  invertible  linear  transformation,  say  the 
Fourier  transform.  Thus,  we  may  write 

i  =  f’i  +  a  (7-3) 

where  3  is  a  discrete  white  noise  signal.  Suppose,  for  example,  that  we  want  to  reconstruct 
the  signal  from  the  magnitude  of  its  Fourier  transform  The  measurements  in  this  case  will 
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be  modeled  as, 


:*!.  =  !  [si*  I  (7.4) 

where  m,  denotes  the  «‘*  component  of  a  vector. 

The  second  modeling  example  comes  from  Radioactive  and  Positron  Emission  Tomogra¬ 
phy  medical  imaging  problems.  At  each  point  t  in  (the  discrete)  space,  there  is  a  radiating 
Poisson  source,  with  parameter  At.  The  desired  quantities  in  this  model,  denoted  A,  will  be 
these  parameters.  With  a  perfect  imaging  system,  we  may  observe  the  vector  j,  where  its 
tth  component  is  the  number  of  photons  or  particles  emitted  by  the  source  in  the  ith  point, 
i.e.  we  may  write, 

/(i/A)  =  !!*"*•  (7.5) 

.  z* 

Our  imaging  system  is  not  perfect,  however.  In  tomography,  for  example,  we  measure  noisy 
projections  of  x,  which  may  be  modeled  as, 

}j  =  H  •  £  +  v  (7.6) 

where  H ,  the  projection  operator,  is  a  non-invertible  linear  transformation.  We  note  that,  in 
this  case,  the  relation  between  complete  and  incomplete  data  is  linear.  However,  we  cannot 
use  the  result  of  the  Linear  Gaussian  case  here,  since  i  has  Poisson  distribution.  Statistical 
models  similar  to  the  model  above  were  suggested  in  medical  tomography  context  in  .86], 
35]  and  elsewhere. 

The  third  example  is  as  follows.  Modeling  images  using  Mirkov  Random  Fields  (MRF) 

has  recently  become  increasingly  popular.  Using  interesting  simulation  algorithms  (e.g. 

[87], [88], [89]), [90] ),  realisations  of  MRF  with  various  parameters  were  generated.  These 

samples  resembled  realistic  images  surprisingly  well.  We  strongly  suggest  reading  [88]  to 

see  how  realistic  images  with  different  characteristics  can  emerge  by  the  innovative  choice 
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of  the  parameters  of  the  MRF  Consider  now  the  following  statistical  framework  of  the 
signal  reconstruction  problem.  Let  the  desired  quantities  be  the  parameters  of  the  MRF, 
denoted  9.  The  process  z  will  be  the  image,  i.e.  it  depends  stochastically  on  9,  via  the 
Gibbs  distribution,  see  e.g.  891, 

!x{i\9)  =  ^ye't/(£i-)/T  (7.7) 

where  Z  is  the  normalization  factor,  and  U  is  called  the  “energy  function”,  where  the 
neighborhood  structure  and  the  characteristic  of  the  image  is  defined.  The  observations,  y, 
will  be  a  noisy  and  incomplete  function  of  x,  as  in  (7.2).  The  EM  algorithm  suggested  in 
this  formulation  is  directed  at  finding  9.  However,  as  a  possible  by-product  in  the  E  step, 
an  estimate  of  x,  the  image  itself,  will  be  available. 

7.2.3  Solution  using  the  EM  method 

Solving  the  mathematical  problems,  generated  by  the  statistical  formulation,  directly 
is  generally  complicated.  However,  since  the  statistical  framework  was  suggested  with  the 
EM  ideas  in  mind,  EM  iterative  algorithms  can  be  naturally  applied.  The  complete  data 
will  be  the  set  of  signals  {x,  $/} ,  i.e.  it  will  include  the  undistorted  signal  x  in  addition  to  the 
observations  £  For  the  cases  where  the  measurement  noise  v  does  not  exist,  the  complete 
data  will  be  the  undistorted  signal  x- 

In  the  E  step  of  the  suggested  algorithms,  the  conditional  expectation  of  the  sufficient 
statistics  of  x  is  calculated,  given  g  and  the  current  value  of  $.  For  example,  in  the  first 
statistical  formulation  above,  if  x  is  Gaussian  with  mean  j,  the  sufficient  statistics  is  linear; 
thus,  this  conditional  expectation  may  be  easily  derived.  For  problems  such  as  reconstruc¬ 
tion  from  Short  Time  Fourier  Transforms,  band  limited  extrapolation,  reconstruction  from 
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projection  and  others,  the  observations,  g,  are  related  to  5  by  a  non-invertible  linear  trans¬ 
formation,  so  we  may  use  the  results  developed  for  the  Linear  Gaussian  case. 

The  M  step  will  be  simple,  since  the  process  j  depends  on  the  unknown  quantities  in 
a  direct  way.  For  example,  in  the  second  statistical  formulation  of  (7.5)  and  (7.6),  we  may 
estimate  the  desired  quantities  A,,  having  the  complete  data  {z,},  as 

A,  =  arg  max  j  -  A,  +  z,  log  A,]  =  n  (7.8) 

To  fix  ideas,  we  will  present  explicitly  the  EM  iterative  algorithm  for  maximum  likelihood 
signal  reconstruction  from  the  magnitude  of  its  Fourier  transform,  in  the  first  statistical 
framework  of  (7.3)  and  (7.4).  We  note  that  a  similar  algorithm  may  be  found  in  [8),  pp  344- 
346. 

We  assume  that  the  signal  is  real  and  has  a  given  finite  support.  This  arpriori  knowledge 
is  incorporated  in  the  form  of  deterministic  constraints.  We  will  denote  the  signal  to  be 
reconstructed  by  s(n)  and  its  Fourier  transform  by  5(w).  The  complete  data  is  given  in  the 
frequency  domain  by, 

X(w)  =  S(w)  +  iV(«)  (7.9) 

where  N( w)  is  complex  Gaussian  random  variable  with  variance  <r2.  The  observations  are 
Y(u)  =  X(w)|. 

The  E  and  M  steps  of  the  EM  algorithm  in  the  kth  iteration  are  given  by, 


•  The  E  step:  Given  s(*)(n)  or  SW(w)  =  iS(**(w)|eJ#*  find, 

X{k)(«)  =  £{X(w)/|.Y(u;)|  =  r(w),S<‘»(w)} 

=  yi  \ . , /-%  cos 9d0 

1  ’  [q(Y  (w)|5(*l(w)|/<r2) 


(7.10) 
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where  /o(-)  is  the  zero  order  modified  Bessel  function  and  I\(-)  is  the  first  order 
modified  Bessel  function. 


•  The  M  step:  Let  i(*’(n)  be  the  inverse  Fourier  transform  of  X^k^(uj),  the  updated 
estimates  s'*+l|(n)  are. 


/?e!x(Jt)(n)j  if  n  is  in  the  signal  support 
0  otherwise 


(7.11) 


We  note  that  since 

Urn  =  ! 

Iq(x) 

as  the  variance  <r2  tends  to  zero,  the  E  step  becomes 


(712) 


i.e.  the  complete  data  is  estimated  by  combining  the  given  magnitude  with  the  current 
estimate  of  the  phase.  This  algorithm  was  suggested  in  >79]  and  ;82|. 

From  this  discussion,  we  gain  a  new  interpretation  of  previously  suggested  algorithms. 
However,  we  also  conclude  that  reconstructing  the  signal  from  the  magnitude  of  its  Fourier 
transform  cannot  be  solved  by  maximizing  the  likelihood  in  this  framework,  since  this  leads 
us  back  to  previous  algorithms,  which  perform  poorly. 

As  we  repeatedly  mentioned,  for  ill* posed  problems  more  information  should  be  incorpo* 
rated.  In  the  statistical  framework,  the  information  can  be  easily  incorporated.  Depending 
on  this  information,  the  adequate  statistical  criterion  will  be  used.  The  resulting  statistical 
problem  for,  say  the  reconstruction  from  magnitude  problem,  will  be  similar  to  (7.10)  and 
(7.11),  where  the  E  step  will  be  the  same;  however,  in  the  M  step,  a  different  criterion  will 


be  invoked. 


7.3  Summary  and  discussion 


The  thesis  may  be  summarized  as  follows.  We  have  solved  signal  processing  problems 
using  a  class  of  iterative  estimation  algorithms.  This  class  of  algorithm  is  based  on  the 
EM  method  suggested  in  2i.  We,  however,  have  extended  this  class  and  developed  several 
general  theoretical  results.  We  will  discuss  the  contributions  made  in  this  thesis  in  these 
two  levels,  the  signal  processing  applications  level  and  the  theoretical  contributions  level. 

7.3.1  The  signal  processing  applications 

When  we  discuss  the  application  of  the  EM  algorithm  to  a  real  world  problem,  we 
first  have  to  model  the  problem  statistically  and  then  apply  the  EM  algorithm  to  solve 
the  resulting  statistical  problem.  However,  the  EM  algorithm  is  not  uniquely  defined;  it. 
depends  on  the  choice  of  complete  data,  and,  as  we  have  seen,  an  unfortunate  choice  yields 
a  completely  useless  algorithm.  The  choice  of  complete  data  or  equivalently  the  choice  of  a 
specific  EM  algorithm  requires  creativity,  in  order  to  get  a  practically  useful  algorithm. 

As  a  general  philosophy,  we  will  have  the  EM  algorithm  in  mind,  while  suggesting 
a  statistical  model  to  the  real  signal  processing  problem.  Using  this  philosophy,  we  will 
identify  what  the  desired  measurements  are,  model  them  statistically,  and  find  their  relation 
to  the  given  observations.  The  statistical  problem,  generated  this  way,  can  then  be  solved 
using  the  EM  algorithm;  the  desired  measurements  will  be  chosen,  naturally,  to  be  the 
complete  data. 

The  main  contribution  of  this  thesis  is  the  explicit  solution  of  the  important  signal 
processing  problems  presented  in  Chapters  4  and  S.  As  we  recall,  these  real  problems  are: 

•  Parameter  estimation  of  superimposed  signals.  Applications  of  this  model,  that  have 
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been  addressed,  are, 


-  Multiple  source  location  (or  bearing)  estimation 

-  Multipath  or  multi-echo  time  delay  estimation 

•  Noise  cancellation  in  a  multiple  microphone  environment.  The  application  considered 
is  a  speech  enhancement  problem. 

In  both  cases,  we  have  suggested  solutions  that  improve  upon  the  existing  state  of  the 
art.  In  the  superimposed  signals  application,  the  ML  approach  has  been  formulated  before. 
However,  since  its  solution  is  complicated,  others  have  avoided  it  and  suggested  suboptimal 
or  ad-hoc  solutions.  We  have  tackled  this  ML  problem  and  succeeded  in  presenting  a  practi¬ 
cal  solution  to  it.  In  the  noise  canceling  problem,  our  contributions  include  the  formulation 
of  the  statistical  ML  problem  to  model  different  physical  situations.  Using  the  EM  method, 
we  were  able  to  suggest  practical  solutions  to  the  underlying  real  problem. 

We  may  consider  Chapters  4  and  5  as  a  demonstration  of  our  suggested  philosophy  for 
solving  signal  processing  problems.  In  these  chapters,  we  have  demonstrated  this  philosophy 
through  all  stages  of  the  solution,  from  modeling,  through  the  suggestion  of  an  algorithm,  to 
the  numerical  solution.  Thus,  these  chapters  will  serve  as  a  reference  for  further  applications. 

7.3.2  The  theoretical  contributions 

The  basic  EM  method  has  been  suggested  in  2j.  However,  tn  the  process  of  considering 
the  applications  mentioned  above,  we  have  extended  and  modified  the  original  EM  algo¬ 
rithm.  We  have  also  derived  explicit  forms  for  some  special  cases.  These  extensions  and 
derivations  made  the  method  more  suitable  for  signal  processing  applications.  We  will  now 
present  and  discuss  these  contributions. 
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•  The  Linear  Gaussian  case 


We  derived  closed  form  analytical  expressions  for  the  EM  algorithm  for  the  case 
where  the  complete  and  incomplete  data  are  jointly  Gaussian,  related  by  a  linear 
transformation.  We  note  that,  in  general,  a  closed  form  analytical  expression  cannot 
be  obtained  and  that  the  EM  algorithm  may  require  complex  operations  like  multiple 
integration.  In  retrospect,  this  derivation  appears  to  be  a  significant  contribution, 
since  it  covers  a  wide  range  of  applications. 

•  EM  algorithms  for  general  estimations  criteria 

Originally,  the  EM  algorithm  was  developed  and  suggested  as  a  technique  for  maximiz¬ 
ing  the  likelihood.  However,  other  criteria  are  more  appropriate  for  some  problems. 
We  have  developed  EM  algorithms  for  optimizing  other  criteria,  specifically  the  Mini¬ 
mum  Information  criterion.  We  note  that,  in  Chapters  2  and  6  of  the  thesis,  a  general 
discussion  on  the  Minimum  Information  criterion,  its  properties  and  its  relations  to 
other  statistical  methods,  is  presented. 

•  Extended  EM:  varying  the  complete  data  in  each  iteration 

As  mentioned  above,  the  choice  of  the  complete  data  may  critically  affect  the  complex¬ 
ity  and  the  rate  of  convergence  of  the  algorithm.  It  may  also  affect  the  convergence 
point,  leading  to  a  different  stationary  point  for  different  choices  of  complete  data.  We 
have  suggested,  in  the  thesis,  an  interesting  alternative  to  a  fixed  choice  of  complete 
data:  we  suggest  varying  the  definition  of  the  complete  data  in  each  step  of  the  algo¬ 
rithm.  This  way,  we  may  get  simpler  schemes,  we  may  get  algorithms  that  converge 
faster  or  algorithms  that  may  escape  from  unwanted  stationary  points. 
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•  Sequential  and  Adaptive  versions 

Sequential  and  adaptive  versions  of  the  EM  algorithm  have  been  developed  in  Chap¬ 
ter  3  and  some  of  their  properties  have  been  derived.  We  have  identified  sequential 
algorithms,  that  are  based  on  problem  structures  and  we  have  used  the  stochastic 
approximation  idea  to  derive  sequential  EM  algorithms  in  the  general  case.  We  have 
applied  these  sequential  algorithms  in  few  examples.  However,  important  topics  for 
further  research  are  the  applications  of  these  sequential  algorithms  to  a  variety  of 
signal  processing  problems  and  a  further  theoretical  analysis  of  these  algorithms. 

As  a  result  of  these  contributions,  a  general  and  flexible  class  of  iterative  estimation 
algorithms  has  been  established.  Beyond  the  theoretical  contributions  and  the  specific  ap¬ 
plications,  we  believe  that  this  thesis  suggests  a  way  of  thinking  and  a  philosophy,  which  may 
be  used  in  a  large  variety  of  seemingly  complex  statistical  inference  and  signal  processing 
problems. 
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Appendix  A 


Convergence  theorems  of  the  EM 
algorithm 


The  convergence  theorem  of  the  EM  algorithm  are  given  in  this  Appendix.  The  theorems 
will  be  presented  in  parallel  to  the  discussion  of  Chapter  2.  i.e.  we  will  start  with  the 
convergence  properties  of  the  likelihood  sequence,  then  the  convergence  properties  of  the 
parameter  estimates  sequence  and  we  will  end  by  discussing  the  rate  of  convergence. 


A.l  Convergence  of  the  likelihood  sequence 

We  start  by  quoting  the  Global  Convergence  Theorem  from  [17]  and  [18].  This  theorem 
is  frequently  used  to  prove  convergence  of  iterative  algorithms  in  numerical  analysis.  Recall 
that  a  point  to  set  map  A/(x),  where  x  6  X,  is  called  closed  at  x  if 

xt-x,n€J[  and  ye  —  y,  y*  €  M(*k)  =»  y  6  M(x) 

For  a  point  to  point  map,  continuity  implies  closedness. 
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Theorem  A.l  (Global  Convergence  Theorem)  Let  the  sequence  {rk}  be  generated  by 
xk  -*  zi,+i  €  M(xk),  where  M  is  a  point  to  set  map.  Let  a  solution  set  T  be  given,  and 
suppose  that: 

(i)  all  points  zk  are  contained  in  a  compact  set  C 
(>i)  M  is  closed  over  the  complement  of  T 
(*'**')  there  is  a  continuous  function  L  such  that 

(a)  if  x  &  T ,  L(y)  >  L(x)  Vy  €  M(x) 

(b)  if  x  €  T,  L(y)  >  L(x)  Vy  6  M(x) 

Then,  all  the  limit  points  of  {**}  are  in  the  solution  set  F  and  L(n)  converges  mono- 
tonically  to  L(x)  for  some  iST 


The  proof  may  be  found  in  (17!  page  91  and  [18]  page  187. 

We  are  interested  in  applying  the  theorem  above  to  the  case  where  £,(•)  is  the  log- 
likelihood  function  defined  over  6,  the  solution  set  is  either  the  set  of  local  m*»ima  M 
or  the  set  of  stationary  points  S,  and  M(-)  is  the  point  to  set  map  implied  by  the  EM 
(GEM)  algorithm.  In  this  case,  condition  («)  is  met  by  the  assumption  that  9o  is  compact, 
condition  («ii)(6)  is  true  by  theorem  2.1,  see  eq.  (2.22);  thus,  we  have  the  following  corollary 
of  theorem  A.l: 


Corollary  A.l  Let  {g^}  be  a  GEM  sequence  generated  by 

g(")  _  g(»+U  6  M(g(")) 

and  suppose  that 

(t)  M  is  a  closed  point  to  set  map  over  the  complement  of  S  {M) 

(ii)  L(-)  is  continuous  and  £(gl,w‘1))  >  L(i^)  for  all  gi"*  g  S  (X) 

Then  all  limit  points  of  (g^)  are  stationary  points  (local  maxima)  of  L,  and 
converges  monotonically  to  L'  =  L( g*)  for  some  g"  €  5  (X) 

For  the  EM  algorithm,  where  Af  (g*n*)  is  the  set  of  maximisers  of  <J(g;  g*"*),  the  following 
continuity  condition 


Q(giigj)  is  continuous  in  both  gt  and.g2 


(A.l) 


implies  the  closedness  of  Xf,  i.e.  it  implies  condition  (t)  in  the  corollary  above.  Now,  if 
we  are  interested  only  in  convergence  to  a  stationary  point,  where  the  solution  set  is  S, 
then  the  continuity  condition  also  implies  condition  (it)  above;  thus  we  have  the  following 
theorem: 

Theorem  A. 2  Suppose  Q  satisfy  the  continuity  condition  (A.l).  Then  all  the  limit  points 
of  any  instance  {g*n*}  of  an  EM  algorithm  are  stationary  points  of  L  and  L(^)  converges 
monotonically  to  L'  =  L(i’)  for  some  stationary  point  £*. 

Proof:  Suppose  that  for  some  9^  x  S  condition  (it)  above  is  not  met,  i.e. 

£(£<"+»>)  =  £(g<n))  (A.2) 

where  e  M(g*n)),  i.e.  it  is  a  global  maximiser  of  Q(-;g*"*)-  Since  is  the  global, 

maximizer  of  (by  Jensen's  inequality,  eq.  (2.14)),  the  equality  in  (A.2)  implies 

Q(p*l;$(n))  =  <?(g('*,;g(n))  and  W;g‘n))  =  ff(g(n);g(n))  (A.3) 

or  in  particular,  that  9^  is  also  a  global  maximizer  of  <?(•;  g1"*).  Now 

L(g<">)  =  Q(g<">;  g<">)  -  #n))  =  ?(g(n))  -  h(g(n))  (A. 4) 

and  is  a  global  maximizer  of  g(-)  and  h(-),  thus,  £<n)  must  be  a  stationary  point  of  £,(•) 

□ 

The  convergence  to  the  set  of  local  maxima,  M,  is  not  guaranteed  by  conditions  above, 
since  we  may  find  outside  X,  but  inside  5,  for  which  indeed  condition  (it)  of  corollary 
A.l  is  not  met.  The  following  theorem  imposes  an  additional  condition,  and  thus  provides 
sufficient  conditions  for  convergence  to  the  set  of  local  maxima  X. 


192 


Theorem  A.3  Suppose  that  in  addition  to  the  continuity  condition  ( A.l ),  Q  satisfies 

sup  i )  >  Q(t  *)  He(S-  M)  ( A.5) 

where  ($  -  M)  is  the  difference  set  {£  €  $'*£  &  M} 

Then  all  the  limit  points  of  any  instance  {^n)}  of  an  EM  algorithm  are  local  maxima  of 
L  and  ;*"*)  converges  monotonically  to  L’  =  £(£’)  for  some  local  maximum  9’ . 

Proof:  Condition  (A.5)  excludes  the  possibility  that  condition  (tt)  of  corollary  A.l  is 
not  met  by  some  €  S  -  M.  Theorem  A.2  proved  that  this  condition  is  met  for  all 
$  S ,  and  thus  it  is  met  for  all  £(n)  g  X.  Thus,  using  corollary  A.l,  this  theorem  follows 
immediately.  C 


A.2  Convergence  of  the  parameter  estimate  sequence 

The  convergence  of  the  likelihood  sequence  does  not  imply  the  convergence  of  the  par 
rameter  estimate  sequence.  However,  if  the  likelihood  sequence  converges  to  a  solution 
set  that  contains  a  single  point,  the  convergence  of  the  parameter  sequence  is  guaranteed 
(trivially),  as  stated  in  the  following  theorem: 


Theorem  A.4  Let  be  an  instance  of  a  GEM  algorithm,  with  a  corresponding  likeli¬ 

hood  sequence  {£.("*}  that  converges  to  some  L‘  and  satisfy  conditions  («),  (ii)  of  corollary 
A.l.  Let  the  solution  set  (S(L’)  or  X(I')  )  be  the  singleton  {9’}.  Then, 


An  important  special  case  of  this  theorem  is  when  the  likelihood  function  is  unimodal 
in  0.  This  case  is  stated  in  the  following  corollary  of  the  theorem  above: 


Corollary  A.3  Suppose  that  £(£)  it  unimodal  in  0  with  (’  being  the  only  stationary  point 
and  that  <J(tf;£)  is  continuous  in  both  (  and  t  Then  any  EM  sequence  (I*"*}  converges 
to  the  unique  maximiztr  f  of  £(f). 
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The  requirement  that  the  solution  space  is  singleton  may  be  relaxed,  if  the  sequence  of 
estimates  is  such  that  j|^"'rl>  -  g("*!|  —  0  as  n  —  oo.  In  this  case,  (g*"*}  will  converge,  if 
the  solution  set  is  discrete,  as  shown  in  the  following  theorem.  We  note  that  a  discrete  set 
is  a  set  whose  only  connected  components  are  singletons. 


Theorem  A.5  Let  (g*"*}  be  a*  instance  of  a  GEM  algorithm,  with  a  corresponding  likeli¬ 
hood  sequence  (L(n>}  that  converges  to  some  L'  and  satisfies  conditions  (t),(tt)  of  corollary 
A.l.  If  Ugl"^1*  -  g<n)||  —  0  as  n  —  oo,  then  all  the  limit  points  of  {g^}  are  in  a  connected 
and  compact  subset  of  S(L")  (or  M(L')).  In  particular,  if  S(L')  (or  M(L'))  is  discrete, 
then  g*n)  converges  to  some  g*  in  S(L’)  (or  M(L’)). 


Proof:  The  sequence  {g*“*}  is  bounded  (by  our  assumptions).  The  set  of  limit  points 
of  a  bounded  sequence  with  Hgt"*1*  -  g(">||  —  0  as  n  —  oo  is  connected  and  compact  (see, 
e.g.  theorem  28.1  of  [91]).  Since  all  the  limit  points  of  {g^}  are  in  S(L')  (or  M(L')),  the 
theorem  follows.  C 


A.3  Rate  of  convergence 


We  start  by  presenting  identities  for  the  derivatives  of  the  log-likelihood  function,  i.e. 
DL( g)  and  D2L( g),  which  are  needed  for  calculating  the  expression  for  the  rate  of  con¬ 
vergence  of  the  EM  algorithm.  To  prove  those  identities,  the  following  well  known  results 
concerning  the  score  function  are  needed. 

Let  fl  be  a  sample  space  and  /(w;  4)  be  a  p.d.f.  defined  over  this  space,  parameterised 
by  d-  Let  the  score  function  be  defined  as, 


Then, 


£#{S(w;*)}  =  Jf  aio,/^J/(w;d)dw  =  0 


(A.6) 
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and 


1 /ar*{5(w;*)}  =  | 


b  log  /( u>;  0) 

do 


/(w;  <£)dw  =  -E* 


j  b 1  log /(<*>;  ^) 

\  b* 


i*)J 


(A.7) 


Suppose  now  that  the  sample  space  is  X(jj)  and  that  /x/y(i/g;£)  >*  defined  over  it. 
Equations  (A.6)  and  (A.7)  become 


E  {  ^  log  fx/YU  iX; |)  /  }  *  D‘°tf («; g)  =  0  (A.8) 

Var  {^'o«/x/y(a/K;«!«.t }  =  =  -Dwff(g;*)  (A.9) 

*“  # 

Differentiating  both  sides  of  (2.12)  and  using  (A.8)  and  (A.9)  above,  we  get  the  following 
identities 

DL(i )  =  D‘°Q(g;  g)  =  S(2;  g)  (A.10) 

D2L(Q  =  DJO0(g;g)  -  0*°ff(g;g)  =  DJ0<3(g;$)  +  Dllff(g;g)  (A.U) 
£>“<?(«;«  *^UH(g;g)  (A.12) 


The  rate  of  convergence  of  a  clan  of  GEM  algorithms  is  now  given  in  the  following 
theorem. 


Theorem  A.6  Let  {g<">}  be  a  sequence  of  a  GEM  algorithm  such  that 

(«)  g(B)  -  r 

(it)  Di0Q(^n*l)\iin))  *0 

(til)  D20Q(f*"'rl*;  g^)  is  negative  definite  with  eigenvalues  bounded  awag  from  tero 

i.e.  g<"‘*’,)  is  a  local  maximiser  of  <?(g;g<'’*).  Then,  DL(tm)  =  0,  is  negative 

definite,  and 

DM(t')  =  (f ;  f)  [D*°Q(i’,  g’)] "  (A.13) 


Proof:  Differentiating  (2.12)  we  get 


DL(tn*x))  *  Dl0Q(tH*l):  tfn))  -  Dl0H(#n+l),  £*">)  (A.14) 
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where  the  first  twin  of  (A.14)  is  mto  by  the  assumptions,  the  second  term  is  stro  in  ths 
limit  MR-aeb y  (A.8)  sad  DL(  f )  *  0.  Similarly,  DKQ(t;i’)  is  negative  definite,  since 
it  is  the  limit  of  DJO0(i*'*'',|>;  I*"*)  whose  eigenvalues  are  bounded  away  from  lero. 

The  last  part  of  the  theorem,  i  t.  showing  (A.  13),  was  proved  in  Chapter  2  using  the 
identities  above,  see  (2.42)-(2.44).  O 
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Appendix  B 


Consistency  of  sequential  EM 
algorithms 


We  will  start  by  presenting  the  following  theorem,  which  is  also  used  for  showing  con¬ 
sistency  of  the  ML  estimator: 


Theorem  B.l  Let  •  •  • ,  •  •  •  be  the  output  of  a  stationary  Markov  source  with  a  finite 

memory  p,  i.e. 

Let  Ln{i)  be  the  log-likelihood  function  given  j£t,  •  -  - ,  as  in  (3.3).  The  sequence  of  func¬ 
tions  /„(£)  =  £  Ln(i)  converges  uniformly  in  probability  l  to  a  limit  1(1)  where  under  regular¬ 
ity  conditions,  the  global  maximum  of  1(0  and  the  unique  solution  to  the  equation  Dl(0  =  0 
is  the  true  parameter  value  Itnx- 


Proof:  The  likelihood  Ln(0  may  be  written  as, 


Ln(0  *  log/*;,  Ti (if*. •••.£,;£) 


(B.i) 


*  loi/r./r..,. -rAu/lU-v 


Xi ;  I)  +  log  fYn. , (&,-  ihU-v  Mi  + 
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For  n  >  >  p, 


/„(!)  =  £  £-„(*)  =  ^log/K./y,_li..n_>(2i/j!._l>",xi.p;«) 

1=1 


(B  2) 


Using  the  strong  law  of  large  numbers,  we  get 


Km /„(g)  =  ^{log/y,/K1_t.  ,K.->(^/^_1.-'-.lf<_p;2)}  *  l(t)  (B.3) 

in  probability  1. 

The  function  1(9)  may  be  written  explicitly  as, 

1(1)  =  J log /yjy.-l  r.-, (&/&.». •  * •  •  Sfi-pi i)fy..  .r.-,()U> ’ •  ^-p 

(B.4) 

or, 

'(*)  *  / •  •■«*¥,_,  [/k-,.  ^_p; &r«)-  (B.s) 

•  /  log  Jyjy..,,  ,Y..,(tihti-i > "  •  •  W  O/n/n-i."  >  *  *  •  - 

Invoking  Jensen’s  inequality  on  the  inner  integral,  we  conclude  that 


'(*)  <  Min.) 


(B.6) 


where  equality  is  achieved  if  and  only  if, 


/y./y.-u =  .Y.-,(iu/ib-i>‘  ‘ ae  (B-7) 

Under  the  identifiability  condition,  the  equality  in  (B.7)  is  achieved  only  if  t  =  9tru*< 
i.e.  &ru,  is  the  unique  global  maximum  of  1(f).  Using  the  differentiability  condition,  and 
the  convexity  of  1(1)  we  also  conclude  that  i lrut  is  the  unique  solution  to  the  equation 
Dl(i)  *  0.  □ 
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This  theorem  may  be  extended  to  more  general  ergodic  sources,  whose  memory  is  fading 
fast  enough.  The  more  general  conditions  may  be  found  in  |48),  appendix  A. 

Using  this  theorem,  we  may  now  state  the  main  consistency  result: 


Theorem  B. 2  Let  the  observations  ,  j/b  •  •  •  be  generated  by  an  ergodie  source  for 

which  theorem  B.l  holds.  Let  {^n*}  be  an  instance  of  a  sequential  EM  algorithm  such  that 
for  any  realisation  of  the  observations, 

(»)  the  sequence  of  estimates  converges  to  a  limit  S' 

(it)  Iim„— oo  Dl0Qn+l(i^-S^)  =  0 

Then,  in  probability  1,  as  n  — *  oo,  —  9,^. 

Proof:  From  the  assumption  (i),  and  using  the  identity  (A.8),  we  may  write, 

Jim  Dl0Hn^(S^);^)  =  Jim  (£*;*•)  =  0  (B.8) 

From  theorem  (B.l)  the  sequence  !„(£)  ss  ^Ln(Q  converges  uniformly  in  probability  1 
to  some  1(9).  The  sequence  of  derivatives  may  be  written  as, 

Dln+ l(£‘n*‘))  =  ±  [i?logn^l(£(n+l);£,,*))  -  Z>10ffn*i(£(n+l);£,n))]  (B.9) 

Thus,  from  the  assumption  (ii),  and  from  (B.8)  above, 

Jim  DU »(£<"**>)  =  0  (B.10) 

Since  Ut(£)  converges  uniformly  to  /(£),  and  using  (B.10)  we  conclude  that 

Jim  Df(g<"-1>)  =  0  (B.ll) 

From  theorem  B.l  and  from  (B.ll),  our  desired  result  follows,  i.e. 

tiH)  -true 

in  probability  1.  □ 
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